Comments on draft-iab-idn-encoding-00.txt from Martin J. Dürst on 2009-08-26 (public-iri@w3.org from August 2009)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Wed, 26 Aug 2009 18:29:57 +0900
To: dthaler@microsoft.com
CC: IAB <iab@ietf.org>, "idna-update@alvestrand.no" <idna-update@alvestrand.no>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <4A950095.1070806@it.aoyama.ac.jp>

Below are my comments on draft-iab-idn-encoding-00.txt
(http://tools.ietf.org/html/draft-iab-idn-encoding-00).

I have cc'ed the IAB list (because this is an IAB document), the list of 
the idnabis WG as the most directly affected WG, and the public-iri 
list, where discussions about the IRI specification are held. Everybody, 
please reduce cross-posting for contributions that are not of interest 
to all three lists.

In general, I think the document is easy to read and understand.

Mentioning ISO-2022-JP for encoding Japanese domain names raises some 
suspicion. ISO-2022-JP may well be (or have been) used in the DNS or a 
similar system, but such use would be atypical, and should be documented 
by a reference. Based on the general "division of labor" of the three 
classical Japanese encodings (ISO-2022-JP, EUC-JP, Shift_JIS), one would 
expect EUC-JP or Shift_JIS rather than ISO-2022-JP in such a case. 
[Among the three, ISO-2022-JP makes it easiest to explain the "heuristic 
encoding detection" scenario described at the end of Section 1.1. But 
without a reference, it may look to some as if ISO-2022-JP was a made-up 
example.]

For the bulleted list at the end of Section 1.1, it should be pointed 
out that UTF-8 can be detected, and distinguished from other 8-bit 
encodings, with much higher precision than just "a byte in the string 
has the 8th bit set". For details, please see 
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf.

The heuristic for punycode that is given in Section 1.1 is "starts with 
xn--". However, on the level of getaddrinfo, we are dealing with domain 
names, not single labels, and something like www.xn--foo.jp should 
definitely be punycode even if it doesn't start with xn--.

The solution that the document seems to be pushing most is heuristic 
detection, i.e. an API where strings in different encodings are fed in 
and the API sorts things out heuristically, converting if necessary. To 
some extent, this may be an unavoidable evil, but it would be good if 
the document were pushing more for clear encoding identification (for 
which I think GetAddrInfoW() (UTF-16) would be an example).

It may be a good idea to also look into the issue of escaped forms of 
domain names being fed into resolver APIs. One form of escaping is 
(UTF-8-based) %-encoding in URIs (and IRIs), which is allowed in URIs 
according to RFC 3986, is the only way to encode non-ASCII in the host 
part of an URI where punycode isn't appropriate, and may be the result 
of a conversion from an IRI to an URI. For further background and 
discussion, please see
http://lists.w3.org/Archives/Public/public-iri/2009Aug/0012.html
and http://lists.w3.org/Archives/Public/public-iri/2009Aug/0024.html and 
the followup discussion.

Another potential kind of escaping are HTML/XML numeric character 
references (of the form &#xABCD;), although I expect them to be less of 
a problem because they are used higher up in the application and usually 
removed early on.

Regards,     Martin.
-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp

Received on Wednesday, 26 August 2009 09:31:10 UTC