- From: Dave Thaler <dthaler@microsoft.com>
- Date: Tue, 15 Sep 2009 08:09:05 -0400
- To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, "John C Klensin" <klensin@jck.com>
- CC: Erik van der Poel <erikv@google.com>, Andrew Sullivan <ajs@shinkuro.com>, "idna-update@alvestrand.no" <idna-update@alvestrand.no>, "public-iri@w3.org" <public-iri@w3.org>
Martin J. Dürst writes: > Hello John, > > [Dave, this is Cc'ed to you because of some discussion relating to > draft-iab-idn-encoding-00.txt.] > > [I'm also cc'ing public-iri@w3.org because of the IRI-related issue at > the end.] > > [Everybody, please remove the Cc fields when they are unnecessary.] > > > Overall, I'm afraid that on this issue, more convoluted explanations > won't convince me nor anybody else, but I'll nevertheless try to answer > your discussion below point-by-point. > > What I (and I guess others on this list) really would like to know is > whether you have any CONCRETE reports or evidence regarding problems > with IDN labels that are longer than 63 octets when expressed in UTF-8. > > Otherwise, Michel has put it much better than me: "given the lack of > issues with IDNA2003 on that specific topic there are no reasons to > introduce an incompatible change". > > > On 2009/09/12 0:47, John C Klensin wrote: > > > > --On Friday, September 11, 2009 17:37 +0900 "\"Martin J. > > Dürst\""<duerst@it.aoyama.ac.jp> wrote: > > > >>> (John claimed that the email context required such a > >>> rule, but I did not bother to confirm that.) > >> Given dinosaur implementations such as sendmail, I can > >> understand the concern that some SMTP implementations may not > >> easily be upgradable to use domain names with more than 255 > >> octets or labels with more than 63 octets. In than case, I > >> would have expected at least a security warning at > >> http://tools.ietf.org/html/rfc4952#section-9 (EAI is currently > >> written in terms of IDNA2003, and so there are no length > >> restrictions on U-labels). > > > > I obviously have not been explaining this very well. The > > problem is not "dinosaur implementations" > > Okay, good. > > > but a combination of > > two things (which interact): > > > > (1) Late resolution of strings, possibly through APIs that > > resolve names in places that may not be the public DNS. > > Systems using those APIs may keep strings in UTF-8 until very > > late in the process, even passing the UTF-8 strings into the > > interface or converting them to ACE form just before calling the > > interface. Either way, because other systems have come to rely > > on the 63 octet limit, strings longer than 63 characters pose a > > risk of unexpected problems. The issues with this are better > > explained in draft-iab-idn-encoding-00.txt, which I would > > strongly encourage people in this WG to go read. Actually systems using those APIs which are the "standard" (with a lower case s) APIs, may keep strings in UTF-8 (or even UTF-16 for common but non-"standard" variants) until very late, and may keep strings in UTF-8 without ever converting them for some protocols, e.g. mDNS, that are defined to use UTF-8. > I have indeed read draft-iab-idn-encoding-00.txt (I sent comments to > the > author and the IAB and copied this list). That document mentions the > length restrictions, as essentially the only restrictions in DNS > itself, > rather than in things on top of it. That document also (well, mainly) > discusses the issue of names being handed down into APIs in various > forms (UTF-8, UTF-16, punycode, legacy encodings,...), and being > resolved by various mechanisms (DNS, NetBIOS, mDNS, hosts file,...), > and > the problem that these mechanisms may use and expect different > encodings > for non-ASCII characters. > > However, I haven't found any mention, nor even a hint, in that > document, > of a need to restrict punycode labels to less than 63 octets when > expressed in UTF-8. I agree with the above characterization. > The document mentions (as something that might happen, but shouldn't) > that an application may pass a UTF-8 string to something like > getaddrinfo, and that string may be passed directly to the DNS. First, > if this happens, IDNA has already lost. I'm don't agree with the "shouldn't", and certainly it was not the intent of draft-iab-idn-encoding-00.txt to actually state whether this "shouldn't" happen, but that it "can" happen (and perhaps "does"). There's also a potential argument in the doc that this is not harmful (see 2nd paragraph of section 4 for instance, and extrapolate from there). > Second, whether the string is > UTF-8 or pure ASCII, if the API isn't prepared to handle labels longer > than 63 octets and overall names longer than 255 octets defensively > (i.e. return something like 'not found'), then the programmer should be > fired. Anyway, in that case, the problem isn't with UTF-8. > > What draft-iab-idn-encoding-00.txt essentially points out is that > different name resolution services use different encodings for non- > ASCII > characters, and that currently different users (meaning applications) > of > a name resolution API may assume different encodings for non-ASCII > characters, which creates all kinds of chances for errors. Some > heuristics may help in some cases, but the right solution (as with all > cases where characters, and in particular non-ASCII ones, are involved) > is to clearly say where which encoding is used. A very simple example > for this is GetAddInfoW, which assumes UTF-16. > > The only potential problem that I see from the discussion in > draft-iab-idn-encoding-00.txt is the following: Some labels containing > non-ASCII characters that fit into 63 octets in punycode and therefore > can be resolved with the DNS may not be resolvable with some other > resolution service because that service may use a different encoding > (and may or may not have different length limits). > > I have absolutely nothing against some text in a Security > Considerations > section or in Rationale pointing out that if you want to set up some > name or label for resolution via multiple different resolution > services, > you have to take care that you choose your names and labels so that > they > meet the length restrictions for all those services. But that doesn't > imply at all that we have to artificially restrict the length of > punycode labels by counting octets in UTF-8. Completely agree with all of the above. I think a brief discussion of this issue may make sense in the next version of draft-iab-idn-encoding, if we can get IAB consensus on text. > > (2) The "conversion of DNS name formats" issue that has been > > extensively discussed as part of the question of alternate label > > separators (sometimes described in our discussions as > > "dot-oids"). Applications that use domain names, including > > domain names that are not going to be resolved (or even looked > > up), must be able to freely and accurately converted between > > DNS-external (dot-separated labels) and DNS-internal > > (length-string pairs) formats _without_ knowing whether they are > > IDNs or not. > > I'm not exactly sure what you mean here. If you want to say "without > checking whether they contain xn-- prefixes and punycode or not", then > I > can agree, but that cannot motivate a UTF-8 based length restriction. Right. I'm not sure why most "applications" would care about DNS- internal (length-string pairs) formats, only NULL-terminated strings (containing dot-separated labels) that get passed to getaddrinfo-like functions. Most applications are (and should be) oblivious to the fact that DNS or some other protocol is used for resolving names. > If you say that applications, rather than first converting U-label -> > A-label and then converting from dot-separated to length-string > notation, have to be able to first convert to length-string notation > and > then convert U-labels to A-labels, then I contend that nobody in their > right mind would do it that way, and even less if "dot-oids" are > involved. For a starter, U-labels don't have a fixed encoding. > > > As discussed earlier, one of several reasons for > > that requirement is that, in non-IDNA-aware contexts, labels in > > non-IDNA-aware applications or contexts may be perfectly valid > > as far as the DNS is concerned, because the only restriction the > > DNS (and the normal label type) imposes is "octets". > > If and where somebody has binary labels, of course these binary labels > must not be longer than 63 octets. But IDNA doesn't use binary labels, > and doesn't stuff UTF-8 into DNS protocol slots, so for IDNA, any > length > restrictions on UTF-8 are irrelevant. > > > That > > length-string format has a hard limit of 63 characters that can > > be exceeded only if one can figure out how to get a larger > > number into six bits (see RFC1035, first paragraph of Section > > 3.1, and elsewhere). > > I very well know that the 63 octets (not characters) limit is a hard > one. In the long run, one might imagine an extension to DNS that uses > another label format, without this limitation, but there is no need at > all to go there for this discussion. > > > If we permit longer U-label strings on the > > theory that the only important restriction is on A-labels, we > > introduce new error states into the format conversion process. > > For IDNA, only A-labels get sent through the DNS protocol, so only > there, the length restrictions for labels is relevant. If somebody gets > this wrong in the format conversion process (we currently don't have > any > reports on that), then that's their problem (and we can point it out in > a Security section or so). > > > If this needs more explanation somewhere (possibly in > > Rationale), I'm happy to try to do that. But I think > > eliminating the restriction would cause far more problems than > > it is worth. > > It hasn't caused ANY problems in IDNA2003. There is nothing new in > IDNA2008 that would motivate a change. *Running code*, one of the > guidelines of the IETF, shows that the restriction is unnecessary. > > > > I note that, while I haven't had time to respond, some of the > > discussion on the IRI list has included an argument that domain > > names in URIs cannot be restricted to A-label forms but must > > include %-escaped UTF-8 simply because those strings might not > > be public-DNS domain names but references to some other database > > or DNS environment. > > It's not 'simply because'. It's first and foremost because of the > syntactic uniformity of URIs, and the fact that it's impossible to > identify all domain names in an URI (the usual slot after the '//' is > easy, scheme-specific processing (which is not what URIs and IRIs are > about) may be able to deal with some of 'mailto', but what do you do > about domain names in query parts? Also, this syntax is part of RFC > 3986, STD 66, a full IETF Standard. > > Overall, it's just a question of what escaping convention should be > used. URIs have their specific escaping convention (%-encoding), and > DNS > has its specific escaping convention (punycode). > > Also please note that the IRI spec doesn't prohibit to use punycode > when > converting to URIs. > > In addition, please note that at least my personal implementation > experience (adding IDN support to Amaya) shows that the overhead of > supporting %-encoding in domain names in URIs is minimal, and helps > streamline the implementation. > > > It seems to me that one cannot have it > > both ways -- either the application knows whether a string is a > > public DNS reference that must conform _only_ to IDNA > > requirements (but then can be restricted to A-labels) or the > > application does not know and therefore must conform to DNS > > requirements for label lengths. > > There is absolutely no need to restrict *all* references just because > *some of them* may use other resolver systems with other length > restrictions (which may be "63 octets per label when measured in UTF-8" > or something completely different). It would be very similar to saying > "Some compilers/linkers can only deal with identifiers 6 characters or > shorter, so all longer identifiers are prohibited." I agree with that. > > For our purposes, the only > > sensible way, at least IMO, to deal with this is to require > > conformance to both sets of rules, i.e., 63 character maximum > > for A-labels and 63 character maximum for U-labels. > > As far as I understand punycode, it's impossible to encode a Unicode > character in less than one octet. This means that a maximum of 63 > *characters* for U-labels is automatically guaranteed by a maximum of > 63 > characters/octets for A-labels. > > However, Defs clearly says "length in octets of the UTF-8 form", so I > guess this was just a slip of your fingers. > > Regards, Martin. -Dave
Received on Tuesday, 15 September 2009 12:09:16 UTC