- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Wed, 26 Aug 2009 18:29:57 +0900
- To: dthaler@microsoft.com
- CC: IAB <iab@ietf.org>, "idna-update@alvestrand.no" <idna-update@alvestrand.no>, "public-iri@w3.org" <public-iri@w3.org>
Below are my comments on draft-iab-idn-encoding-00.txt (http://tools.ietf.org/html/draft-iab-idn-encoding-00). I have cc'ed the IAB list (because this is an IAB document), the list of the idnabis WG as the most directly affected WG, and the public-iri list, where discussions about the IRI specification are held. Everybody, please reduce cross-posting for contributions that are not of interest to all three lists. In general, I think the document is easy to read and understand. Mentioning ISO-2022-JP for encoding Japanese domain names raises some suspicion. ISO-2022-JP may well be (or have been) used in the DNS or a similar system, but such use would be atypical, and should be documented by a reference. Based on the general "division of labor" of the three classical Japanese encodings (ISO-2022-JP, EUC-JP, Shift_JIS), one would expect EUC-JP or Shift_JIS rather than ISO-2022-JP in such a case. [Among the three, ISO-2022-JP makes it easiest to explain the "heuristic encoding detection" scenario described at the end of Section 1.1. But without a reference, it may look to some as if ISO-2022-JP was a made-up example.] For the bulleted list at the end of Section 1.1, it should be pointed out that UTF-8 can be detected, and distinguished from other 8-bit encodings, with much higher precision than just "a byte in the string has the 8th bit set". For details, please see http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf. The heuristic for punycode that is given in Section 1.1 is "starts with xn--". However, on the level of getaddrinfo, we are dealing with domain names, not single labels, and something like www.xn--foo.jp should definitely be punycode even if it doesn't start with xn--. The solution that the document seems to be pushing most is heuristic detection, i.e. an API where strings in different encodings are fed in and the API sorts things out heuristically, converting if necessary. To some extent, this may be an unavoidable evil, but it would be good if the document were pushing more for clear encoding identification (for which I think GetAddrInfoW() (UTF-16) would be an example). It may be a good idea to also look into the issue of escaped forms of domain names being fed into resolver APIs. One form of escaping is (UTF-8-based) %-encoding in URIs (and IRIs), which is allowed in URIs according to RFC 3986, is the only way to encode non-ASCII in the host part of an URI where punycode isn't appropriate, and may be the result of a conversion from an IRI to an URI. For further background and discussion, please see http://lists.w3.org/Archives/Public/public-iri/2009Aug/0012.html and http://lists.w3.org/Archives/Public/public-iri/2009Aug/0024.html and the followup discussion. Another potential kind of escaping are HTML/XML numeric character references (of the form ꯍ), although I expect them to be less of a problem because they are used higher up in the application and usually removed early on. Regards, Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Wednesday, 26 August 2009 09:31:10 UTC