W3C home > Mailing lists > Public > public-iri@w3.org > November 2009

RE: Using Punicode for host names in IRI -> URI translation; phishing; comparison

From: Shawn Steele <Shawn.Steele@microsoft.com>
Date: Tue, 17 Nov 2009 19:39:16 +0000
To: Larry Masinter <masinter@adobe.com>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>, Peter Constable <petercon@microsoft.com>
CC: Pete Resnick <presnick@qualcomm.com>, Ted Hardie <ted.ietf@gmail.com>
Message-ID: <E14011F8737B524BB564B05FF748464A04452DC3@TK5EX14MBXC139.redmond.corp.microsoft.com>
1)  I don't think it's practical to say that URI's must be mapped to punycode.  People'll use Unicode anyway.  That's pretty much been proven with IDN.

2) I think that A & B are reasonable approaches to spoofing concerns.  I'm not sure I agree so much about unambiguously-enterable though.  In most cases someone will probably type it a form that's expected, perhaps if some mappings are applied, but it is probably unavoidable that some Latin user will see a Cyrillic URL and mistakenly try to enter it in Latin.  I can't see how a specification can solve that problem.

3) The statement that Unicode can't be compared like ASCII because ASCII has a canonical form, but Unicode has various mappings is misleading.  The traditional "single canonical form" for plain-ASCII mappings don't work for everyone (eg: Turkish i).  However the "ascii lower case" is assumed to work, whether appropriate or not.  So extending a standardized mapping to Unicode wouldn't be "any worse" than the ASCII behavior we have now.  It's just more obvioius becuase there's more code points and we're "used" to ASCII so we ignore it's limitations.

Exact comparisons might be best, but a BCP or UTR that describes good practices might be helpful.  If exact comparisons are required, it should be enforced in ASCII as well.  (My logic is that if it isn't "good enough" for ASCII users, then Unicode users probably would also have problems).


From: public-iri-request@w3.org [public-iri-request@w3.org] on behalf of Larry Masinter [masinter@adobe.com]
Sent: Tuesday, November 17, 2009 10:56 AM
Cc: Pete Resnick; Ted Hardie
Subject: Using Punicode for host names in IRI -> URI translation; phishing; comparison

These are “strawman” proposals in response to the IAB talk at the IETF meeting last week: knock down if you can.

  1.  Punicode for host names vs. non-public-DNS resolution
A number of the concerns in the about using punicode for domain names when doing IRI -> URI translation seem to have come from the fact that there are widely deployed private networks which don’t use punicode at all, but rather send UTF8 directly to DNS, which the DNS protocol allows.

However, I think for the use case of URI, we could take the position that URIs are really intended to be “UNIFORM” resource identifiers, whose primary use is for communication over the world-wide web, and that that use case should predominate, and that, for that reason, IRI -> URI translation MUST use punicode for host names.  We should then note that private environments with additional mappings may need to deploy software that

  *   uses the IRI form directly (i.e., don’t translate IRI -> URI first)
  *    translates punicode host names back into UTF8 for sending to locally specified host name mappings  (i.e., undo the IRI->URI translation)
  *   provide alternative registration or lookup services for punicode version of host names


  1.  Spoofing

Secondly, there are a number of concerns raised about spoofing. Of course, spoofing is an issue with just ASCII too,
example.com vs example.corn  being difficult to distinguish, (never mind example.C0M).

The observation is that there are many ways in which names can be formed for which there is NO visible distinction between what are separate unicode encodings.

The main way I think of addressing these are:

  1.  Visual validation of URIs and IRIs is basically *NOT EFFECTIVE* and that user agents *SHOULD NOT USE* visual validation as the primary way of preventing spoofing. Other methods for protecting against phishing *MUST* be used.   I think we can point to some of the techniques that browsers currently already deploy as alternatives, without making them normative.
  2.  Anyone who prints an IRI on the side of a bus or a matchbook cover has the responsibility of making sure that what they print can be typed in a way that leads to an unambiguous result.  Currently this advice only applies to ASCII-only URIs, and the extension to other non-ASCII URIs depends on infrastructure that is NOT currently part of, or mandated by, or appropriate for, the IRI specification.   Unfortunately, the implementation advice on how to generate an unambiguously-enterable IRI depends on technology deployment which HAS NOT YET HAPPENED, and nothing we can specify in the IETF will make it happen sooner.  We can give some advice that will mitigate a few of the problems, but so few that making that advice normative isn’t actually helpful.

  1.  Comparison

(Related but different from spoofing) There are a number of examples in the IAB presentation of cases where comparison of ASCII-only identifiers can’t readily be extended to comparison of Unicode-extended identifiers, because of the multiple representations and lack of a single canonical form (such as with ASCII where case-insensitive comparison => lower case canonical form).

I think with the case of URIs and IRIs that the only comparison that should be normative is “exact string compare”, and that any other comparison, normalization or other advice should be scheme specific. We could take the entire comparison section of the IRI document and separate it out into its own BCP.

Anyway, these are 3 strawman proposals for how to deal with technical issues raised by the IAB talk at the technical plenary. Fire away.


From: public-iri-request@w3.org [mailto:public-iri-request@w3.org] On Behalf Of Larry Masinter
Sent: Tuesday, November 17, 2009 8:36 AM
Cc: Pete Resnick; Ted Hardie
Subject: Slides from IETF

Still waiting for meeting summary & minutes

My slides:
(no particularly new material)

was an IAB report on Internationalization in Names and other Identifiers

I think we need to be careful to make sure we’ve considered the issues raised by the IAB.

Personally, I thought the BOF went well and that there was clear general agreement that there was work to be done, many people agreed to participate, but that we had to update the charter in response to the discussion.

Received on Tuesday, 17 November 2009 19:40:01 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:39:40 UTC