- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Sat, 12 Sep 2009 12:14:12 +0900
- To: John C Klensin <klensin@jck.com>
- CC: Erik van der Poel <erikv@google.com>, Andrew Sullivan <ajs@shinkuro.com>, idna-update@alvestrand.no, dthaler@microsoft.com, "public-iri@w3.org" <public-iri@w3.org>
Hello John, [Dave, this is Cc'ed to you because of some discussion relating to draft-iab-idn-encoding-00.txt.] [I'm also cc'ing public-iri@w3.org because of the IRI-related issue at the end.] [Everybody, please remove the Cc fields when they are unnecessary.] Overall, I'm afraid that on this issue, more convoluted explanations won't convince me nor anybody else, but I'll nevertheless try to answer your discussion below point-by-point. What I (and I guess others on this list) really would like to know is whether you have any CONCRETE reports or evidence regarding problems with IDN labels that are longer than 63 octets when expressed in UTF-8. Otherwise, Michel has put it much better than me: "given the lack of issues with IDNA2003 on that specific topic there are no reasons to introduce an incompatible change". On 2009/09/12 0:47, John C Klensin wrote: > > --On Friday, September 11, 2009 17:37 +0900 "\"Martin J. > Dürst\""<duerst@it.aoyama.ac.jp> wrote: > >>> (John claimed that the email context required such a >>> rule, but I did not bother to confirm that.) >> Given dinosaur implementations such as sendmail, I can >> understand the concern that some SMTP implementations may not >> easily be upgradable to use domain names with more than 255 >> octets or labels with more than 63 octets. In than case, I >> would have expected at least a security warning at >> http://tools.ietf.org/html/rfc4952#section-9 (EAI is currently >> written in terms of IDNA2003, and so there are no length >> restrictions on U-labels). > > I obviously have not been explaining this very well. The > problem is not "dinosaur implementations" Okay, good. > but a combination of > two things (which interact): > > (1) Late resolution of strings, possibly through APIs that > resolve names in places that may not be the public DNS. > Systems using those APIs may keep strings in UTF-8 until very > late in the process, even passing the UTF-8 strings into the > interface or converting them to ACE form just before calling the > interface. Either way, because other systems have come to rely > on the 63 octet limit, strings longer than 63 characters pose a > risk of unexpected problems. The issues with this are better > explained in draft-iab-idn-encoding-00.txt, which I would > strongly encourage people in this WG to go read. I have indeed read draft-iab-idn-encoding-00.txt (I sent comments to the author and the IAB and copied this list). That document mentions the length restrictions, as essentially the only restrictions in DNS itself, rather than in things on top of it. That document also (well, mainly) discusses the issue of names being handed down into APIs in various forms (UTF-8, UTF-16, punycode, legacy encodings,...), and being resolved by various mechanisms (DNS, NetBIOS, mDNS, hosts file,...), and the problem that these mechanisms may use and expect different encodings for non-ASCII characters. However, I haven't found any mention, nor even a hint, in that document, of a need to restrict punycode labels to less than 63 octets when expressed in UTF-8. The document mentions (as something that might happen, but shouldn't) that an application may pass a UTF-8 string to something like getaddrinfo, and that string may be passed directly to the DNS. First, if this happens, IDNA has already lost. Second, whether the string is UTF-8 or pure ASCII, if the API isn't prepared to handle labels longer than 63 octets and overall names longer than 255 octets defensively (i.e. return something like 'not found'), then the programmer should be fired. Anyway, in that case, the problem isn't with UTF-8. What draft-iab-idn-encoding-00.txt essentially points out is that different name resolution services use different encodings for non-ASCII characters, and that currently different users (meaning applications) of a name resolution API may assume different encodings for non-ASCII characters, which creates all kinds of chances for errors. Some heuristics may help in some cases, but the right solution (as with all cases where characters, and in particular non-ASCII ones, are involved) is to clearly say where which encoding is used. A very simple example for this is GetAddInfoW, which assumes UTF-16. The only potential problem that I see from the discussion in draft-iab-idn-encoding-00.txt is the following: Some labels containing non-ASCII characters that fit into 63 octets in punycode and therefore can be resolved with the DNS may not be resolvable with some other resolution service because that service may use a different encoding (and may or may not have different length limits). I have absolutely nothing against some text in a Security Considerations section or in Rationale pointing out that if you want to set up some name or label for resolution via multiple different resolution services, you have to take care that you choose your names and labels so that they meet the length restrictions for all those services. But that doesn't imply at all that we have to artificially restrict the length of punycode labels by counting octets in UTF-8. > (2) The "conversion of DNS name formats" issue that has been > extensively discussed as part of the question of alternate label > separators (sometimes described in our discussions as > "dot-oids"). Applications that use domain names, including > domain names that are not going to be resolved (or even looked > up), must be able to freely and accurately converted between > DNS-external (dot-separated labels) and DNS-internal > (length-string pairs) formats _without_ knowing whether they are > IDNs or not. I'm not exactly sure what you mean here. If you want to say "without checking whether they contain xn-- prefixes and punycode or not", then I can agree, but that cannot motivate a UTF-8 based length restriction. If you say that applications, rather than first converting U-label -> A-label and then converting from dot-separated to length-string notation, have to be able to first convert to length-string notation and then convert U-labels to A-labels, then I contend that nobody in their right mind would do it that way, and even less if "dot-oids" are involved. For a starter, U-labels don't have a fixed encoding. > As discussed earlier, one of several reasons for > that requirement is that, in non-IDNA-aware contexts, labels in > non-IDNA-aware applications or contexts may be perfectly valid > as far as the DNS is concerned, because the only restriction the > DNS (and the normal label type) imposes is "octets". If and where somebody has binary labels, of course these binary labels must not be longer than 63 octets. But IDNA doesn't use binary labels, and doesn't stuff UTF-8 into DNS protocol slots, so for IDNA, any length restrictions on UTF-8 are irrelevant. > That > length-string format has a hard limit of 63 characters that can > be exceeded only if one can figure out how to get a larger > number into six bits (see RFC1035, first paragraph of Section > 3.1, and elsewhere). I very well know that the 63 octets (not characters) limit is a hard one. In the long run, one might imagine an extension to DNS that uses another label format, without this limitation, but there is no need at all to go there for this discussion. > If we permit longer U-label strings on the > theory that the only important restriction is on A-labels, we > introduce new error states into the format conversion process. For IDNA, only A-labels get sent through the DNS protocol, so only there, the length restrictions for labels is relevant. If somebody gets this wrong in the format conversion process (we currently don't have any reports on that), then that's their problem (and we can point it out in a Security section or so). > If this needs more explanation somewhere (possibly in > Rationale), I'm happy to try to do that. But I think > eliminating the restriction would cause far more problems than > it is worth. It hasn't caused ANY problems in IDNA2003. There is nothing new in IDNA2008 that would motivate a change. *Running code*, one of the guidelines of the IETF, shows that the restriction is unnecessary. > I note that, while I haven't had time to respond, some of the > discussion on the IRI list has included an argument that domain > names in URIs cannot be restricted to A-label forms but must > include %-escaped UTF-8 simply because those strings might not > be public-DNS domain names but references to some other database > or DNS environment. It's not 'simply because'. It's first and foremost because of the syntactic uniformity of URIs, and the fact that it's impossible to identify all domain names in an URI (the usual slot after the '//' is easy, scheme-specific processing (which is not what URIs and IRIs are about) may be able to deal with some of 'mailto', but what do you do about domain names in query parts? Also, this syntax is part of RFC 3986, STD 66, a full IETF Standard. Overall, it's just a question of what escaping convention should be used. URIs have their specific escaping convention (%-encoding), and DNS has its specific escaping convention (punycode). Also please note that the IRI spec doesn't prohibit to use punycode when converting to URIs. In addition, please note that at least my personal implementation experience (adding IDN support to Amaya) shows that the overhead of supporting %-encoding in domain names in URIs is minimal, and helps streamline the implementation. > It seems to me that one cannot have it > both ways -- either the application knows whether a string is a > public DNS reference that must conform _only_ to IDNA > requirements (but then can be restricted to A-labels) or the > application does not know and therefore must conform to DNS > requirements for label lengths. There is absolutely no need to restrict *all* references just because *some of them* may use other resolver systems with other length restrictions (which may be "63 octets per label when measured in UTF-8" or something completely different). It would be very similar to saying "Some compilers/linkers can only deal with identifiers 6 characters or shorter, so all longer identifiers are prohibited." > For our purposes, the only > sensible way, at least IMO, to deal with this is to require > conformance to both sets of rules, i.e., 63 character maximum > for A-labels and 63 character maximum for U-labels. As far as I understand punycode, it's impossible to encode a Unicode character in less than one octet. This means that a maximum of 63 *characters* for U-labels is automatically guaranteed by a maximum of 63 characters/octets for A-labels. However, Defs clearly says "length in octets of the UTF-8 form", so I guess this was just a slip of your fingers. Regards, Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Saturday, 12 September 2009 03:15:32 UTC