RE: Using Punicode for host names in IRI -> URI translation; phishing; comparison from Shawn Steele on 2009-11-17 (public-iri@w3.org from November 2009)

From: Shawn Steele <Shawn.Steele@microsoft.com>
Date: Tue, 17 Nov 2009 21:52:24 +0000
To: Larry Masinter <masinter@adobe.com>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>, Peter Constable <petercon@microsoft.com>
CC: Pete Resnick <presnick@qualcomm.com>, Ted Hardie <ted.ietf@gmail.com>
Message-ID: <E14011F8737B524BB564B05FF748464A04452EBF@TK5EX14MBXC139.redmond.corp.microsoft.>
Sorry, stuffy head, and new to this part of the discussion :)

1) I'd really like to avoid punycode "leaking" everywhere :)

I question the usefulness of a URI that is restricted to ASCII for terms of the host name?  Surely % encoding isn't expected for the host name by legacy apps?  And punycode "would work" if someone had to stick it into a URI, but legacy apps wouldn't know how to do the conversion, so supporting a IRI would be better.  So instead of shoehorning Punycode into URIs, I'd rather see apps use an IRI.  (then you don't need % encoding either.)  As mentioned, my head is really foggy, so maybe I'm not giving enough consideration to the scenarios.

2) Re: Latin entry of cryillic, I'm not sure how much control the publisher or the URL has.  If I put a bus ad for my hotel in Moscow, then presumably I'd expect Cyrillic users.  However a tourist may accidentally transcribe it to Latin.  I think that amount of error is probably reasonable and fairly unavoidable.  (and hopefully rare, after all not every cyrillic word looks like latin)

3) I think comparison/mapping can be a different RFC or BCP.  As the IDN & UTS46 work demonstrates, there's some passion that mapping/comparison be very consistent however, so I think it should be formalized, and required if used.  (So is BCP strong enough? or would it have to be an RFC.)  UTS46 only considers IDN thought and didn't think further to the rest of an IRI.  The advantage of a UTS is that when Unicode code points were added, then Unicode could update any mappings.  Since the UTC are experts in Unicode that makes some sense.

An interesting problem with mapping is the behavior of the global part of the information vs the local part.  By this I mean that some concepts like DNS are global and must act consistently whether I'm a US user in a Turkish airport or some other combination that might impact my mapping expectations.  So there must be an international/lingual standard mapping that everyone uses, even if it isn't perfectt for all languages.

Local paths however don't necessarily have the same technical restriction.  Once I get to a Turkish server, then could the actual file path use turkish i casing for the i?  EAI (mail) allows the local part to have mappings determined locally by the server, which probably makes sense for mail.  However that prevents a remote client from being able to determine equality reliably, so perhaps that's not acceptable for an IRI.

P.S.:  I'm really curious about BIDI for IRIs :)

-Shawn

________________________________
From: Larry Masinter [masinter@adobe.com]
Sent: Tuesday, November 17, 2009 12:30 PM
To: Shawn Steele; PUBLIC-IRI@W3.ORG; Peter Constable
Cc: Pete Resnick; Ted Hardie
Subject: RE: Using Punicode for host names in IRI -> URI translation; phishing; comparison

1)    I don't think it's practical to say that URI's must be mapped to punycode.  People'll use Unicode anyway.  That's pretty much been proven with IDN.

I’m afraid my “strawman” proposal was pretty terse and I was trying to make a more subtle point. I try to be explicit about “URI” and “IRI”. A “URI” is ASCII only. An “IRI” uses Unicode. Yes, people will use Unicode and write IRIs.  ( “IRI” is more or less what HTML5 currently calls a “URL” ).

The only question is how to handle the translation if you have an IRI (HTML5 URL) and want to actually turn it into a URI (ASCII-only).  For most of the parts of a URL/IRI, the translation is to use %xx percent-encoding of UTF8.

But, just for the host name, and to avoid sending %xx percent-encoded host names to DNS systems which don’t know about them, the proposal is: Always use punicode for the host name, when going from IRI to URI.

There was some concern about doing that, because there are cases where host names *aren’t* resolved using DNS at all, but, well, WINS for example. What should they do if they see a punicode host name? My strawman is that if you see something with punicode in the host name field (or %xx for that matter) you should turn it back into unicode.


2)    I think that A & B are reasonable approaches to spoofing concerns.  I'm not sure I agree so much about unambiguously-enterable though.  In most cases someone will probably type it a form that's expected, perhaps if some mappings are applied, but it is probably unavoidable that some Latin user will see a Cyrillic URL and mistakenly try to enter it in Latin.  I can't see how a specification can solve that problem.

Well, yes, the publisher of a (human readable form of a) Cyrillic URL has the responsibility of making sure that users who see the URL won’t mistype it and enter it as Latin.

3)    The statement that Unicode can't be compared like ASCII because ASCII has a canonical form, but Unicode has various mappings is misleading. The traditional "single canonical form" for plain-ASCII mappings don't work for everyone (eg: Turkish i).

I’m not sure what I said that was wrong; I agree it was terse. Mainly I was summarizing what was a lengthy exposition in the IAB presentation, though.

 However the "ascii lower case" is assumed to work, whether appropriate or not.  So extending a standardized mapping to Unicode wouldn't be "any worse" than the ASCII behavior we have now.  It's just more obvioius becuase there's more code points and we're "used" to ASCII so we ignore it's limitations.

Exact comparisons might be best, but a BCP or UTR that describes good practices might be helpful.  If exact comparisons are required, it should be enforced in ASCII as well.  (My logic is that if it isn't "good enough" for ASCII users, then Unicode users probably would also have problems).

I’m not sure if you’re agreeing to the separation of the “comparison” BCP?

Larry

________________________________
From: public-iri-request@w3.org [public-iri-request@w3.org] on behalf of Larry Masinter [masinter@adobe.com]
Sent: Tuesday, November 17, 2009 10:56 AM
To: PUBLIC-IRI@W3.ORG
Cc: Pete Resnick; Ted Hardie
Subject: Using Punicode for host names in IRI -> URI translation; phishing; comparison
These are “strawman” proposals in response to the IAB talk at the IETF meeting last week: knock down if you can.
----


  1.  Punicode for host names vs. non-public-DNS resolution
A number of the concerns in the about using punicode for domain names when doing IRI -> URI translation seem to have come from the fact that there are widely deployed private networks which don’t use punicode at all, but rather send UTF8 directly to DNS, which the DNS protocol allows.

However, I think for the use case of URI, we could take the position that URIs are really intended to be “UNIFORM” resource identifiers, whose primary use is for communication over the world-wide web, and that that use case should predominate, and that, for that reason, IRI -> URI translation MUST use punicode for host names.  We should then note that private environments with additional mappings may need to deploy software that


  *   uses the IRI form directly (i.e., don’t translate IRI -> URI first)
  *    translates punicode host names back into UTF8 for sending to locally specified host name mappings  (i.e., undo the IRI->URI translation)
  *   provide alternative registration or lookup services for punicode version of host names

-----


  1.  Spoofing

Secondly, there are a number of concerns raised about spoofing. Of course, spoofing is an issue with just ASCII too,
example.com vs example.corn  being difficult to distinguish, (never mind example.C0M).

The observation is that there are many ways in which names can be formed for which there is NO visible distinction between what are separate unicode encodings.

The main way I think of addressing these are:


  1.  Visual validation of URIs and IRIs is basically *NOT EFFECTIVE* and that user agents *SHOULD NOT USE* visual validation as the primary way of preventing spoofing. Other methods for protecting against phishing *MUST* be used.   I think we can point to some of the techniques that browsers currently already deploy as alternatives, without making them normative.
  2.  Anyone who prints an IRI on the side of a bus or a matchbook cover has the responsibility of making sure that what they print can be typed in a way that leads to an unambiguous result.  Currently this advice only applies to ASCII-only URIs, and the extension to other non-ASCII URIs depends on infrastructure that is NOT currently part of, or mandated by, or appropriate for, the IRI specification.   Unfortunately, the implementation advice on how to generate an unambiguously-enterable IRI depends on technology deployment which HAS NOT YET HAPPENED, and nothing we can specify in the IETF will make it happen sooner.  We can give some advice that will mitigate a few of the problems, but so few that making that advice normative isn’t actually helpful.
------

  1.  Comparison

(Related but different from spoofing) There are a number of examples in the IAB presentation of cases where comparison of ASCII-only identifiers can’t readily be extended to comparison of Unicode-extended identifiers, because of the multiple representations and lack of a single canonical form (such as with ASCII where case-insensitive comparison => lower case canonical form).

I think with the case of URIs and IRIs that the only comparison that should be normative is “exact string compare”, and that any other comparison, normalization or other advice should be scheme specific. We could take the entire comparison section of the IRI document and separate it out into its own BCP.

Anyway, these are 3 strawman proposals for how to deal with technical issues raised by the IAB talk at the technical plenary. Fire away.

Larry
-----
http://larry.masinter.net

From: public-iri-request@w3.org [mailto:public-iri-request@w3.org] On Behalf Of Larry Masinter
Sent: Tuesday, November 17, 2009 8:36 AM
To: PUBLIC-IRI@W3.ORG
Cc: Pete Resnick; Ted Hardie
Subject: Slides from IETF

Still waiting for meeting summary & minutes


My slides:
http://www.ietf.org/proceedings/09nov/slides/iri-0.pdf
(no particularly new material)


http://www.ietf.org/proceedings/09nov/slides/plenaryt-1.pdf
was an IAB report on Internationalization in Names and other Identifiers

I think we need to be careful to make sure we’ve considered the issues raised by the IAB.


Personally, I thought the BOF went well and that there was clear general agreement that there was work to be done, many people agreed to participate, but that we had to update the charter in response to the discussion.


Larry
--
http://larry.masinter.net
Received on Tuesday, 17 November 2009 21:53:11 UTC