- From: Erik van der Poel <erikv@google.com>
- Date: Tue, 25 Aug 2009 08:37:58 -0700
- To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Cc: "Roy T. Fielding" <fielding@gbiv.com>, public-iri@w3.org, "John Klensin (klensin@jck.com)" <klensin@jck.com>
Hello Martin, On Tue, Aug 25, 2009 at 3:16 AM, "Martin J. Dürst"<duerst@it.aoyama.ac.jp> wrote: > On 2009/08/11 5:49, Erik van der Poel wrote: >> The browsers are inconsistent in their processing of %-escapes in the >> host name in HTTP HREFs. For example, Safari 4 converts<a >> href="http://十%2ecom"> to xn--.com-9b5j, which means that they >> are unescaping the %2e *after* Punycoding. > > Very wrong indeed. Can somebody talk to them? I filed a bug a while ago: https://bugs.webkit.org/show_bug.cgi?id=16559 > Please note that %2e (period/dot) is actually a bad example. The dot is the > label separator, which makes it a very special case. Yes, it is a very special case, and it shows how badly things go when you perform the steps in the wrong order. >> Also, the browsers do very different things with %-escaped UTF-8. IE 8 >> puts the %-escape triplets directly into DNS packets, > > Very wrong indeed. Can somebody talk to them? I will start with Shawn Steele. >> Firefox 3.5 refuses to do the DNS lookup, > > Sad and easily fixable, I guess. Can somebody talk to them? I filed a bug a while ago: https://bugzilla.mozilla.org/show_bug.cgi?id=412457 As you can see in the above bug report, it is not so trivial to decide what to do. (Detecting UTF-8, control characters, etc.) >> and Safari 4 unescapes, runs it through the >> iso-8859-1 to utf-8 converter and puts the 8-bit text into the DNS >> packet. > > So this means raw doubly-escaped UTF-8? Doubly-converted UTF-8. >>> In order to >>> reduce the hazard of confusables, user agents need a consistent >>> algorithm for performing a name lookup that takes into account >>> all of the potential Unicode encodings (punycode, pct-encode, raw, >>> etc.). Likewise, agents need to be able to normalize such names >>> to a common format for the purpose of name comparison, such as >>> within spam or virus checking software that filters on host. >> >> IDNAbis complicates this issue by allowing characters that used to be >> "mapped away" in IDNA2003. For example, IDNA2003 maps Eszett (U+00DF) >> to ss, while IDNAbis allows U+00DF in U-labels. > > This is indeed a complication, but only a minor one, and one which should > mostly be absorbed by 'bundling' or something like that from the affected > registries. If the major client implementers decide not to perform multiple lookup, then it is a minor issue (for client implementers). >>> 5) non-ASCII characters in host names is not all that new. >>> >>> IDNA and the increasing use of non-ASCII names may be a >>> relatively new thing to the Internet, but many operating >>> system name resolvers have allowed non-ASCII names for >>> much longer. We therefore cannot assume that all non-ASCII >>> names must be transformed via IDNA to an A-label. In any >>> case, most user agents do not do so (but I haven't tested >>> that lately). >> >> IE 6 unescapes %-escaped non-UTF-8 and puts that into the HTTP Host >> header, while IE 7 and 8 put the %-escape triplets directly into the >> Host header. > > What do other browsers do for the host header? %-escaped non-UTF-8 is a very special case (and very rare). The other browsers don't even perform the DNS lookup in that case, so they don't make any TCP connection for the HTTP request either. > My understanding was that the > right thing (the thing that actually works) is to put punycode in there. Yes, when the host name is encoded "normally", the browsers put Punycode in there. > Should this be an issue for HTTPbis? I think HTTPbis should say something, to gently push the implementers in one direction, whichever direction that might be. Erik
Received on Tuesday, 25 August 2009 15:38:43 UTC