Re: Definitions limit on label length in UTF-8 from Martin J. Dürst on 2009-09-12 (public-iri@w3.org from September 2009)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Sat, 12 Sep 2009 12:14:12 +0900
To: John C Klensin <klensin@jck.com>
CC: Erik van der Poel <erikv@google.com>, Andrew Sullivan <ajs@shinkuro.com>, idna-update@alvestrand.no, dthaler@microsoft.com, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <4AAB1204.6090705@it.aoyama.ac.jp>
Hello John,

[Dave, this is Cc'ed to you because of some discussion relating to 
draft-iab-idn-encoding-00.txt.]

[I'm also cc'ing public-iri@w3.org because of the IRI-related issue at 
the end.]

[Everybody, please remove the Cc fields when they are unnecessary.]


Overall, I'm afraid that on this issue, more convoluted explanations 
won't convince me nor anybody else, but I'll nevertheless try to answer 
your discussion below point-by-point.

What I (and I guess others on this list) really would like to know is 
whether you have any CONCRETE reports or evidence regarding problems 
with IDN labels that are longer than 63 octets when expressed in UTF-8.

Otherwise, Michel has put it much better than me: "given the lack of 
issues with IDNA2003 on that specific topic there are no reasons to 
introduce an incompatible change".


On 2009/09/12 0:47, John C Klensin wrote:
>
> --On Friday, September 11, 2009 17:37 +0900 "\"Martin J.
> Dürst\""<duerst@it.aoyama.ac.jp>  wrote:
>
>>> (John claimed that the email context required such a
>>> rule, but I did not bother to confirm that.)
>> Given dinosaur implementations such as sendmail, I can
>> understand the concern that some SMTP implementations may not
>> easily be upgradable to use domain names with more than 255
>> octets or labels with more than 63 octets. In than case, I
>> would have expected at least a security warning at
>> http://tools.ietf.org/html/rfc4952#section-9 (EAI is currently
>> written in terms of IDNA2003, and so there are no length
>> restrictions on U-labels).
>
> I obviously have not been explaining this very well.  The
> problem is not "dinosaur implementations"

Okay, good.

> but a combination of
> two things (which interact):
>
> (1) Late resolution of strings, possibly through APIs that
> resolve names in places that may not be the public DNS.
> Systems using those APIs may keep strings in UTF-8 until very
> late in the process, even passing the UTF-8 strings into the
> interface or converting them to ACE form just before calling the
> interface.  Either way, because other systems have come to rely
> on the 63 octet limit, strings longer than 63 characters pose a
> risk of unexpected problems.  The issues with this are better
> explained in draft-iab-idn-encoding-00.txt, which I would
> strongly encourage people in this WG to go read.

I have indeed read draft-iab-idn-encoding-00.txt (I sent comments to the 
author and the IAB and copied this list). That document mentions the 
length restrictions, as essentially the only restrictions in DNS itself, 
rather than in things on top of it. That document also (well, mainly) 
discusses the issue of names being handed down into APIs in various 
forms (UTF-8, UTF-16, punycode, legacy encodings,...), and being 
resolved by various mechanisms (DNS, NetBIOS, mDNS, hosts file,...), and 
the problem that these mechanisms may use and expect different encodings 
for non-ASCII characters.

However, I haven't found any mention, nor even a hint, in that document, 
of a need to restrict punycode labels to less than 63 octets when 
expressed in UTF-8.

The document mentions (as something that might happen, but shouldn't) 
that an application may pass a UTF-8 string to something like 
getaddrinfo, and that string may be passed directly to the DNS. First, 
if this happens, IDNA has already lost. Second, whether the string is 
UTF-8 or pure ASCII, if the API isn't prepared to handle labels longer 
than 63 octets and overall names longer than 255 octets defensively 
(i.e. return something like 'not found'), then the programmer should be 
fired. Anyway, in that case, the problem isn't with UTF-8.

What draft-iab-idn-encoding-00.txt essentially points out is that 
different name resolution services use different encodings for non-ASCII 
characters, and that currently different users (meaning applications) of 
a name resolution API may assume different encodings for non-ASCII 
characters, which creates all kinds of chances for errors. Some 
heuristics may help in some cases, but the right solution (as with all 
cases where characters, and in particular non-ASCII ones, are involved) 
is to clearly say where which encoding is used. A very simple example 
for this is GetAddInfoW, which assumes UTF-16.

The only potential problem that I see from the discussion in 
draft-iab-idn-encoding-00.txt is the following: Some labels containing 
non-ASCII characters that fit into 63 octets in punycode and therefore 
can be resolved with the DNS may not be resolvable with some other 
resolution service because that service may use a different encoding 
(and may or may not have different length limits).

I have absolutely nothing against some text in a Security Considerations 
section or in Rationale pointing out that if you want to set up some 
name or label for resolution via multiple different resolution services, 
you have to take care that you choose your names and labels so that they 
meet the length restrictions for all those services. But that doesn't 
imply at all that we have to artificially restrict the length of 
punycode labels by counting octets in UTF-8.


> (2) The "conversion of DNS name formats" issue that has been
> extensively discussed as part of the question of alternate label
> separators (sometimes described in our discussions as
> "dot-oids").  Applications that use domain names, including
> domain names that are not going to be resolved (or even looked
> up), must be able to freely and accurately converted between
> DNS-external (dot-separated labels) and DNS-internal
> (length-string pairs) formats _without_ knowing whether they are
> IDNs or not.

I'm not exactly sure what you mean here. If you want to say "without 
checking whether they contain xn-- prefixes and punycode or not", then I 
can agree, but that cannot motivate a UTF-8 based length restriction.

If you say that applications, rather than first converting U-label -> 
A-label and then converting from dot-separated to length-string 
notation, have to be able to first convert to length-string notation and 
then convert U-labels to A-labels, then I contend that nobody in their 
right mind would do it that way, and even less if "dot-oids" are 
involved. For a starter, U-labels don't have a fixed encoding.

> As discussed earlier, one of several reasons for
> that requirement is that, in non-IDNA-aware contexts, labels in
> non-IDNA-aware applications or contexts may be perfectly valid
> as far as the DNS is concerned, because the only restriction the
> DNS (and the normal label type) imposes is "octets".

If and where somebody has binary labels, of course these binary labels 
must not be longer than 63 octets. But IDNA doesn't use binary labels, 
and doesn't stuff UTF-8 into DNS protocol slots, so for IDNA, any length 
restrictions on UTF-8 are irrelevant.

> That
> length-string format has a hard limit of 63 characters that can
> be exceeded only if one can figure out how to get a larger
> number into six bits (see RFC1035, first paragraph of Section
> 3.1, and elsewhere).

I very well know that the 63 octets (not characters) limit is a hard 
one. In the long run, one might imagine an extension to DNS that uses 
another label format, without this limitation, but there is no need at 
all to go there for this discussion.

> If we permit longer U-label strings on the
> theory that the only important restriction is on A-labels, we
> introduce new error states into the format conversion process.

For IDNA, only A-labels get sent through the DNS protocol, so only 
there, the length restrictions for labels is relevant. If somebody gets 
this wrong in the format conversion process (we currently don't have any 
reports on that), then that's their problem (and we can point it out in 
a Security section or so).

> If this needs more explanation somewhere (possibly in
> Rationale), I'm happy to try to do that.  But I think
> eliminating the restriction would cause far more problems than
> it is worth.

It hasn't caused ANY problems in IDNA2003. There is nothing new in 
IDNA2008 that would motivate a change. *Running code*, one of the 
guidelines of the IETF, shows that the restriction is unnecessary.


> I note that, while I haven't had time to respond, some of the
> discussion on the IRI list has included an argument that domain
> names in URIs cannot be restricted to A-label forms but must
> include %-escaped UTF-8 simply because those strings might not
> be public-DNS domain names but references to some other database
> or DNS environment.

It's not 'simply because'. It's first and foremost because of the 
syntactic uniformity of URIs, and the fact that it's impossible to 
identify all domain names in an URI (the usual slot after the '//' is 
easy, scheme-specific processing (which is not what URIs and IRIs are 
about) may be able to deal with some of 'mailto', but what do you do 
about domain names in query parts? Also, this syntax is part of RFC 
3986, STD 66, a full IETF Standard.

Overall, it's just a question of what escaping convention should be 
used. URIs have their specific escaping convention (%-encoding), and DNS
has its specific escaping convention (punycode).

Also please note that the IRI spec doesn't prohibit to use punycode when 
converting to URIs.

In addition, please note that at least my personal implementation 
experience (adding IDN support to Amaya) shows that the overhead of 
supporting %-encoding in domain names in URIs is minimal, and helps 
streamline the implementation.

> It seems to me that one cannot have it
> both ways -- either the application knows whether a string is a
> public DNS reference that must conform _only_ to IDNA
> requirements (but then can be restricted to A-labels) or the
> application does not know and therefore must conform to DNS
> requirements for label lengths.

There is absolutely no need to restrict *all* references just because 
*some of them* may use other resolver systems with other length 
restrictions (which may be "63 octets per label when measured in UTF-8" 
or something completely different). It would be very similar to saying 
"Some compilers/linkers can only deal with identifiers 6 characters or 
shorter, so all longer identifiers are prohibited."

> For our purposes, the only
> sensible way, at least IMO, to deal with this is to require
> conformance to both sets of rules, i.e., 63 character maximum
> for A-labels and 63 character maximum for U-labels.

As far as I understand punycode, it's impossible to encode a Unicode 
character in less than one octet. This means that a maximum of 63 
*characters* for U-labels is automatically guaranteed by a maximum of 63 
characters/octets for A-labels.

However, Defs clearly says "length in octets of the UTF-8 form", so I 
guess this was just a slip of your fingers.

Regards,    Martin.

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
Received on Saturday, 12 September 2009 03:15:32 UTC