- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Wed, 18 Nov 2009 20:02:21 +0900
- To: Larry Masinter <masinter@adobe.com>
- CC: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>, Pete Resnick <presnick@qualcomm.com>, Ted Hardie <ted.ietf@gmail.com>
Hello Larry, others, Many thanks for picking up these topics. I'll try to answer separately on each of the issues to hopefully simplify the discussion. I'll try to follow up on the "phishing" and "comparison" topics tomorrow (may become Friday). On 2009/11/18 3:56, Larry Masinter wrote: > These are "strawman" proposals in response to the IAB talk at the IETF meeting last week: knock down if you can. > ---- > > > 1. Punicode for host names vs. non-public-DNS resolution > A number of the concerns in the about using punicode for domain names when doing >IRI -> URI translation seem to have come from the fact that there are widely deployed >private networks which don't use punicode at all, but rather send UTF8 directly to DNS, >which the DNS protocol allows. Yes. > However, I think for the use case of URI, we could take the position that URIs are really >intended to be "UNIFORM" resource identifiers, whose primary use is for communication over >the world-wide web, and that that use case should predominate, and that, for that reason, >IRI -> URI translation MUST use punicode for host names. Why MUST? I understand the goal is to not send %-encoding through the DNS, which is indeed a desirable goal, but URIs and the DNS aren't exactly the same, and sending some %-characters to the DNS won't create many false positives, so I don't understand why you give this goal such a high, virtually absolute, priority. And what should browsers and other software do with URIs that already contain %-encoding in their domain-name part? >We should then note that private >environments with additional mappings may need to deploy software that > > * uses the IRI form directly (i.e., don't translate IRI -> URI first) > * translates punicode host names back into UTF8 for sending to locally specified host name mappings (i.e., undo the IRI->URI translation) > * provide alternative registration or lookup services for punicode version of host names This looks like the easiest path forward in the short term. And RFC 3987 already allowed using punycode when converting from an IRI to an URI. (I agree that the wording there wasn't optimal, and can and should be improved.) But because domain names in percent-encoded form will turn up anyway in various places (i.e. in query parts in URIs,...), it doesn't make sense to disallow them in one location in an URI with a "MUST punycode". It would be much better to look at the long-term, well-layered structure of the whole thing. For me, the answer to the question on slide 23 of http://www.ietf.org/proceedings/09nov/slides/plenaryt-1.pdf >>>> How Many Layers of Encoding? • How do we encode: A domain name… in an email address… in a “mailto” URL… in a web page? • Do we use: – Punycode (“xn‐‐…”) encoding for the domain name? – Email Quoted‐Printable (“=XX”) encoding? – URL percent (“%XX”) escaping? – HTML ampersand (“&#xxxx;”) codes? • All of the above? >>>> Is that there are two rules: 1) When you can, use the character directly 2) When you can't, use the escaping convention of the current format For the above question, it would mean that because we are in a Web page, we use the character directly, or use &#xHHHH; if the character cannot be represented in the encoding of the Web page. If we remove the lowest part of the question and talk about • How do we encode: A domain name… in an email address… in a “mailto” URL? then we would use URI escaping conventions, i.e. UTF-8-based %HH, of course only if we don't use an IRI, in which case again the characters could be used directly. If we remove another layer, we get: • How do we encode: A domain name… in an email address? I don't understand escaping conventions in email addresses per se very well, and I guess they may not be applicable because they don't deal with escaping arbitrary Unicode characters, but if we are speaking about an EAI email address, we wouldn't need any escaping at all. If we remove another layer, we get: • How do we encode: A domain name? Here we have to choose punycode or UTF-8 depending on the circumstances. It is clear that on the public Internet, in DNS request and response packets, we have to use punycode. But if somebody designed an API to handle the issues discussed in draft-iab-idn-encoding-01.txt, I very much hope s/he would design it with UTF-8 (or whatever other encoding of Unicode appropriate for the platform, e.g. UTF-16 for Java,...) at the core, potentially with the option to detect (and if necessary "backfix") punycode and UTF-8-based %-encoding, rather than with punycode at the core. Architecturally, the reasons for this should be very clear. What I think we have to do for the IRI spec is to try and make things go into the right direction architecturally without forbidding short-term shortcuts. In terms of browser implementations, we are already half-way there (in all the cases below, I tested putting an IRI into the address field; all tests were done on Windows Vista): - Opera, Amaya, and Google chrome support %-escaping for non-ASCII characters in domain names. - Mozilla (3.0.15) doesn't support it directly, but converts the %-escaping back to Unicode characters in the address field. This means that you have to go back again to the address field and hit return to find the page. This seems rather accidental, and probably the best thing is to fix it, but I'd be glad to hear from somebody involved in Mozilla development. - Safari (4.0.3 (531.9.1)) is even more confusing: It converts the address field from percent-escaping to punycode but then claims that the punycode can't be reached (although I verified that it got the right punycode). It puts the readable (IRI) version of the domain name into a Google search box on the page. Again, if you go back to the address field and hit return, it finds the page, and converts the address field to an IRI. Again, this seems quite accidental, and probably the best thing is to fix it so that it works directly. - IE 7 is the only browser that is straightforward: percent-escaping stays as percent-escaping, and the site isn't reached. Regards, Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Wednesday, 18 November 2009 11:03:24 UTC