Re: issue: handling non-ascii hostnames in URIs (and IRIs/Hrefs)

Hi Roy,

Comments on specific items are inline, below:

On Mon, Aug 10, 2009 at 6:45 AM, Roy T. Fielding<fielding@gbiv.com> wrote:
> 1) URIs often contain host names, and not just in authority.

Yes, but implementations are not expected to recognize a host name as
such in the path component or query component of an HTTP URI/IRI/HREF.
If a host name appears in such components, implementations will simply
handle it the same way that it handles any text in such components
(i.e. not-%-escaped non-ASCII text in the path component becomes
%-escaped UTF-8, while not-%-escaped non-ASCII text in the query
component gets converted to Unicode and back, in the case of HREFs).

>   The reg-name syntax allows percent-encoded octets in order to
>   represent non-ASCII registered names in a uniform way that is
>   independent of the underlying name resolution technology.  Non-ASCII
>   characters must first be encoded according to UTF-8 [STD63], and then
>   each octet of the corresponding UTF-8 sequence must be percent-
>   encoded to be represented as URI characters.  URI producing
>   applications must not use percent-encoding in host unless it is used
>   to represent a UTF-8 character sequence.  When a non-ASCII registered
>   name represents an internationalized domain name intended for
>   resolution via the DNS, the name must be transformed to the IDNA
>   encoding [RFC3490] prior to name lookup.  URI producers should
>   provide these registered names in the IDNA encoding, rather than a
>   percent-encoding, if they wish to maximize interoperability with
>   legacy URI resolvers.
>
> Note that 3986 does not indicate a specific algorithm for looking
> up such names in the name resolver.  It doesn't even require that
> the pct-encoded octets be supplied to the name resolver in the
> *appropriate* encoding that is expected+supported by the resolver API,
> aside from the second-to-last sentence about IDNA, though conversion
> to the API character encoding was certainly what I was thinking.

The browsers are inconsistent in their processing of %-escapes in the
host name in HTTP HREFs. For example, Safari 4 converts <a
href="http://&#x5341;%2ecom"> to xn--.com-9b5j, which means that they
are unescaping the %2e *after* Punycoding.

Also, the browsers do very different things with %-escaped UTF-8. IE 8
puts the %-escape triplets directly into DNS packets, Firefox 3.5
refuses to do the DNS lookup, Opera 9 and Chrome 2 convert to the
corresponding Punycode, and Safari 4 unescapes, runs it through the
iso-8859-1 to utf-8 converter and puts the 8-bit text into the DNS
packet.

> 3) Aliases and confusables are dangerous.
>
> Phishing is a security problem, especially when a host name can
> be constructed that looks like a well-known host but consists of
> slightly different characters.  Preventing confusable domains,
> where possible, is one of the concerns of IDNAbis.

IDNAbis does not try to solve the entire security problem. It
disallows most punctuation and symbol characters, but it does not
disallow the Cyrillic 'a', which looks like the Latin 'a'. The working
group knows this, and expects others (registries, clients) to take
steps to protect users.

> In order to
> reduce the hazard of confusables, user agents need a consistent
> algorithm for performing a name lookup that takes into account
> all of the potential Unicode encodings (punycode, pct-encode, raw,
> etc.).  Likewise, agents need to be able to normalize such names
> to a common format for the purpose of name comparison, such as
> within spam or virus checking software that filters on host.

IDNAbis complicates this issue by allowing characters that used to be
"mapped away" in IDNA2003. For example, IDNA2003 maps Eszett (U+00DF)
to ss, while IDNAbis allows U+00DF in U-labels.

> 5) non-ASCII characters in host names is not all that new.
>
> IDNA and the increasing use of non-ASCII names may be a
> relatively new thing to the Internet, but many operating
> system name resolvers have allowed non-ASCII names for
> much longer.  We therefore cannot assume that all non-ASCII
> names must be transformed via IDNA to an A-label.  In any
> case, most user agents do not do so (but I haven't tested
> that lately).

IE 6 unescapes %-escaped non-UTF-8 and puts that into the HTTP Host
header, while IE 7 and 8 put the %-escape triplets directly into the
Host header.

Also, IE 7 and 8 have an option to turn off IDNA in the intranet
(presumably sending UTF-8 instead).

> So, that's the issue.  I'll resist the temptation to try to
> solve it in this message.  Is this something that IRI should
> try to solve/suggest potential remedies, should a solution be
> hammered out by the URI folks and posted as an errata for 3986,
> or should a more general working group (dealing with all forms
> of name resolution lookup and not just those from URI refs)
> handle the problem?  Did I forget any relevant bits above?

This thread seems to be about non-ASCII host names, but I might have
comments about plain ASCII host names too. E.g. should http://tk/ be
written as http://tk./ to try to prevent lookups such as
tk.intranet.company.com where suffixes are appended?

> Likewise, I'd like to know exactly what the current browsers
> and non-HTML user agents do when they perform a name lookup
> on various operating systems.

At Google, we have developed HTML test files that cause a browser to
emit DNS and HTTP packets, and a tool that generates reports from
WireShark (Ethereal) packet sniff files. Diffs are highlighted, so we
can see differences between browsers, browser versions, and operating
systems.

Erik

Received on Monday, 10 August 2009 20:50:45 UTC