Re: issue: handling non-ascii hostnames in URIs (and IRIs/Hrefs) from Martin J. Dürst on 2009-08-25 (public-iri@w3.org from August 2009)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Tue, 25 Aug 2009 19:16:39 +0900
To: Erik van der Poel <erikv@google.com>
CC: "Roy T. Fielding" <fielding@gbiv.com>, public-iri@w3.org, "John Klensin (klensin@jck.com)" <klensin@jck.com>
Message-ID: <4A93BA07.7010505@it.aoyama.ac.jp>
Hello Erik,

On 2009/08/11 5:49, Erik van der Poel wrote:
> Hi Roy,
>
> Comments on specific items are inline, below:
>
> On Mon, Aug 10, 2009 at 6:45 AM, Roy T. Fielding<fielding@gbiv.com>  wrote:
>> 1) URIs often contain host names, and not just in authority.
>
> Yes, but implementations are not expected to recognize a host name as
> such in the path component or query component of an HTTP URI/IRI/HREF.
> If a host name appears in such components, implementations will simply
> handle it the same way that it handles any text in such components
> (i.e. not-%-escaped non-ASCII text in the path component becomes
> %-escaped UTF-8, while not-%-escaped non-ASCII text in the query
> component gets converted to Unicode and back, in the case of HREFs).

I agree that this is the only way to do things. Applications on servers 
that want to deal with IDNs (I hope there will be a lot of these) will 
have to deal with such kind of data.


> The browsers are inconsistent in their processing of %-escapes in the
> host name in HTTP HREFs. For example, Safari 4 converts<a
> href="http://&#x5341;%2ecom">  to xn--.com-9b5j, which means that they
> are unescaping the %2e *after* Punycoding.

Very wrong indeed. Can somebody talk to them?
Please note that %2e (period/dot) is actually a bad example. The dot is 
the label separator, which makes it a very special case.

> Also, the browsers do very different things with %-escaped UTF-8. IE 8
> puts the %-escape triplets directly into DNS packets,

Very wrong indeed. Can somebody talk to them?

> Firefox 3.5 refuses to do the DNS lookup,

Sad and easily fixable, I guess. Can somebody talk to them?

> Opera 9 and Chrome 2 convert to the corresponding Punycode,

Oh, so the browsers doing the correct thing are in the majority. Great, 
I hope the others will be joining soon.

> and Safari 4 unescapes, runs it through the
> iso-8859-1 to utf-8 converter and puts the 8-bit text into the DNS
> packet.

So this means raw doubly-escaped UTF-8? Very very wrong indeed. Can 
somebody talk to them?

>> 3) Aliases and confusables are dangerous.
>>
>> Phishing is a security problem, especially when a host name can
>> be constructed that looks like a well-known host but consists of
>> slightly different characters.  Preventing confusable domains,
>> where possible, is one of the concerns of IDNAbis.
>
> IDNAbis does not try to solve the entire security problem. It
> disallows most punctuation and symbol characters, but it does not
> disallow the Cyrillic 'a', which looks like the Latin 'a'. The working
> group knows this, and expects others (registries, clients) to take
> steps to protect users.

Very good point.

>>   In order to
>> reduce the hazard of confusables, user agents need a consistent
>> algorithm for performing a name lookup that takes into account
>> all of the potential Unicode encodings (punycode, pct-encode, raw,
>> etc.).  Likewise, agents need to be able to normalize such names
>> to a common format for the purpose of name comparison, such as
>> within spam or virus checking software that filters on host.
>
> IDNAbis complicates this issue by allowing characters that used to be
> "mapped away" in IDNA2003. For example, IDNA2003 maps Eszett (U+00DF)
> to ss, while IDNAbis allows U+00DF in U-labels.

This is indeed a complication, but only a minor one, and one which 
should mostly be absorbed by 'bundling' or something like that from the 
affected registries.

>> 5) non-ASCII characters in host names is not all that new.
>>
>> IDNA and the increasing use of non-ASCII names may be a
>> relatively new thing to the Internet, but many operating
>> system name resolvers have allowed non-ASCII names for
>> much longer.  We therefore cannot assume that all non-ASCII
>> names must be transformed via IDNA to an A-label.  In any
>> case, most user agents do not do so (but I haven't tested
>> that lately).
>
> IE 6 unescapes %-escaped non-UTF-8 and puts that into the HTTP Host
> header, while IE 7 and 8 put the %-escape triplets directly into the
> Host header.

What do other browsers do for the host header? My understanding was that 
the right thing (the thing that actually works) is to put punycode in 
there. Should this be an issue for HTTPbis?

Regards,    Martin.

> Also, IE 7 and 8 have an option to turn off IDNA in the intranet
> (presumably sending UTF-8 instead).
>
>> So, that's the issue.  I'll resist the temptation to try to
>> solve it in this message.  Is this something that IRI should
>> try to solve/suggest potential remedies, should a solution be
>> hammered out by the URI folks and posted as an errata for 3986,
>> or should a more general working group (dealing with all forms
>> of name resolution lookup and not just those from URI refs)
>> handle the problem?  Did I forget any relevant bits above?
>
> This thread seems to be about non-ASCII host names, but I might have
> comments about plain ASCII host names too. E.g. should http://tk/ be
> written as http://tk./ to try to prevent lookups such as
> tk.intranet.company.com where suffixes are appended?
>
>> Likewise, I'd like to know exactly what the current browsers
>> and non-HTML user agents do when they perform a name lookup
>> on various operating systems.
>
> At Google, we have developed HTML test files that cause a browser to
> emit DNS and HTTP packets, and a tool that generates reports from
> WireShark (Ethereal) packet sniff files. Diffs are highlighted, so we
> can see differences between browsers, browser versions, and operating
> systems.
>
> Erik
>
>

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
Received on Tuesday, 25 August 2009 10:17:41 UTC