W3C home > Mailing lists > Public > public-iri@w3.org > August 2009

Re: issue: handling non-ascii hostnames in URIs (and IRIs/Hrefs)

From: Erik van der Poel <erikv@google.com>
Date: Tue, 25 Aug 2009 08:37:58 -0700
Message-ID: <c07a32650908250837q577382ffn8edf2670495d69a@mail.gmail.com>
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: "Roy T. Fielding" <fielding@gbiv.com>, public-iri@w3.org, "John Klensin (klensin@jck.com)" <klensin@jck.com>
Hello Martin,

On Tue, Aug 25, 2009 at 3:16 AM, "Martin J.
Dürst"<duerst@it.aoyama.ac.jp> wrote:
> On 2009/08/11 5:49, Erik van der Poel wrote:
>> The browsers are inconsistent in their processing of %-escapes in the
>> host name in HTTP HREFs. For example, Safari 4 converts<a
>> href="http://&#x5341;%2ecom">  to xn--.com-9b5j, which means that they
>> are unescaping the %2e *after* Punycoding.
> Very wrong indeed. Can somebody talk to them?

I filed a bug a while ago:


> Please note that %2e (period/dot) is actually a bad example. The dot is the
> label separator, which makes it a very special case.

Yes, it is a very special case, and it shows how badly things go when
you perform the steps in the wrong order.

>> Also, the browsers do very different things with %-escaped UTF-8. IE 8
>> puts the %-escape triplets directly into DNS packets,
> Very wrong indeed. Can somebody talk to them?

I will start with Shawn Steele.

>> Firefox 3.5 refuses to do the DNS lookup,
> Sad and easily fixable, I guess. Can somebody talk to them?

I filed a bug a while ago:


As you can see in the above bug report, it is not so trivial to decide
what to do. (Detecting UTF-8, control characters, etc.)

>> and Safari 4 unescapes, runs it through the
>> iso-8859-1 to utf-8 converter and puts the 8-bit text into the DNS
>> packet.
> So this means raw doubly-escaped UTF-8?

Doubly-converted UTF-8.

>>>  In order to
>>> reduce the hazard of confusables, user agents need a consistent
>>> algorithm for performing a name lookup that takes into account
>>> all of the potential Unicode encodings (punycode, pct-encode, raw,
>>> etc.).  Likewise, agents need to be able to normalize such names
>>> to a common format for the purpose of name comparison, such as
>>> within spam or virus checking software that filters on host.
>> IDNAbis complicates this issue by allowing characters that used to be
>> "mapped away" in IDNA2003. For example, IDNA2003 maps Eszett (U+00DF)
>> to ss, while IDNAbis allows U+00DF in U-labels.
> This is indeed a complication, but only a minor one, and one which should
> mostly be absorbed by 'bundling' or something like that from the affected
> registries.

If the major client implementers decide not to perform multiple
lookup, then it is a minor issue (for client implementers).

>>> 5) non-ASCII characters in host names is not all that new.
>>> IDNA and the increasing use of non-ASCII names may be a
>>> relatively new thing to the Internet, but many operating
>>> system name resolvers have allowed non-ASCII names for
>>> much longer.  We therefore cannot assume that all non-ASCII
>>> names must be transformed via IDNA to an A-label.  In any
>>> case, most user agents do not do so (but I haven't tested
>>> that lately).
>> IE 6 unescapes %-escaped non-UTF-8 and puts that into the HTTP Host
>> header, while IE 7 and 8 put the %-escape triplets directly into the
>> Host header.
> What do other browsers do for the host header?

%-escaped non-UTF-8 is a very special case (and very rare). The other
browsers don't even perform the DNS lookup in that case, so they don't
make any TCP connection for the HTTP request either.

> My understanding was that the
> right thing (the thing that actually works) is to put punycode in there.

Yes, when the host name is encoded "normally", the browsers put
Punycode in there.

> Should this be an issue for HTTPbis?

I think HTTPbis should say something, to gently push the implementers
in one direction, whichever direction that might be.

Received on Tuesday, 25 August 2009 15:38:43 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:14:35 UTC