Re: issue: handling non-ascii hostnames in URIs (and IRIs/Hrefs) from Martin J. Dürst on 2009-08-25 (public-iri@w3.org from August 2009)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Tue, 25 Aug 2009 19:01:32 +0900
To: "Roy T. Fielding" <fielding@gbiv.com>, "John Klensin (klensin@jck.com)" <klensin@jck.com>
CC: public-iri@w3.org
Message-ID: <4A93B67C.9010102@it.aoyama.ac.jp>
Hello Roy,

Sorry for not replying earlier, I was on vacation.

Many thanks for this very detailed summary, which is very helpful.

Two questions for the moment; hopefully I will be able to comment in 
more detail later.

First, why are there two (at least seemingly) identical emails
(http://lists.w3.org/Archives/Public/public-iri/2009Aug/0010.html and
http://lists.w3.org/Archives/Public/public-iri/2009Aug/0007.html)? Is 
this a simple 'clerical error', or something else?

Second, while you give a lot of detailed background in your mail, I fail 
to see THE issue. I feel like I don't see the forest for all the trees. 
A simple summary (something like "1 and 3 and 4 together contradict 2 
and 5 because of foo") might be enough to help me out.

I would also be very glad to hear from John about whether he has 
anything else to add.

Regards,    Martin.

On 2009/08/10 22:45, Roy T. Fielding wrote:
> During the Stockholm IETF's BOF on IRIbis, John Klensin described a
> point of contention between the various hostname-related specifications
> for IDNAbis and the way that both 3986 and 3987 allow Unicode characters
> (in raw or pct-encoded form, or both) within the reg-name syntax and
> within other parts of the URI syntax (depending on scheme) that may
> hold a domain name. It is a difficult issue because it combines
> multiple standards, multiple name resolution APIs (only one of which
> is expected to comply with IDNA), and a huge number of deployed
> client implementations.
>
> Unfortunately, John was in a rush to attend another meeting and it
> was a bit late in the day for critical thinking, so I agreed to try
> to describe the issue (as I understood it) in writing to the group.
> Hopefully, John can add clarifications where I misunderstood.
>
> All of the following applies equally to Hrefs, IRIs, and URIs
> (via pct-encoded octets). The problem is as follows:
>
> 1) URIs often contain host names, and not just in authority.
>
> These are not always DNS names, as indicated in RFC 3986:
>
> This specification does not mandate a particular registered name
> lookup technology and therefore does not restrict the syntax of reg-
> name beyond what is necessary for interoperability. Instead, it
> delegates the issue of registered name syntax conformance to the
> operating system of each application performing URI resolution, and
> that operating system decides what it will allow for the purpose of
> host identification. A URI resolution implementation might use DNS,
> host tables, yellow pages, NetInfo, WINS, or any other system for
> lookup of registered names. However, a globally scoped naming
> system, such as DNS fully qualified domain names, is necessary for
> URIs intended to have global scope. URI producers should use names
> that conform to the DNS syntax, even when use of DNS is not
> immediately apparent, and should limit these names to no more than
> 255 characters in length.
>
> The reg-name syntax allows percent-encoded octets in order to
> represent non-ASCII registered names in a uniform way that is
> independent of the underlying name resolution technology. Non-ASCII
> characters must first be encoded according to UTF-8 [STD63], and then
> each octet of the corresponding UTF-8 sequence must be percent-
> encoded to be represented as URI characters. URI producing
> applications must not use percent-encoding in host unless it is used
> to represent a UTF-8 character sequence. When a non-ASCII registered
> name represents an internationalized domain name intended for
> resolution via the DNS, the name must be transformed to the IDNA
> encoding [RFC3490] prior to name lookup. URI producers should
> provide these registered names in the IDNA encoding, rather than a
> percent-encoding, if they wish to maximize interoperability with
> legacy URI resolvers.
>
> Note that 3986 does not indicate a specific algorithm for looking
> up such names in the name resolver. It doesn't even require that
> the pct-encoded octets be supplied to the name resolver in the
> *appropriate* encoding that is expected+supported by the resolver API,
> aside from the second-to-last sentence about IDNA, though conversion
> to the API character encoding was certainly what I was thinking.
>
> 2) DNS is a flexible, generic name lookup system.
>
> In principle, DNS can be used to lookup any kind of name. There
> is hardly any constraint on the syntax of names for lookup -- the
> constraints are on what is allowed to be *registered*. There is
> nothing to say that a pct-encoded triplet could not appear in DNS
> as a local name, though everyone seems to agree that is unlikely
> except for deliberate attempts to alias a non-encoded name.
> It is important, therefore, that user agents deal with the
> pct-encoded octets internally and *not* expect the resolver to
> translate them.
>
> 3) Aliases and confusables are dangerous.
>
> Phishing is a security problem, especially when a host name can
> be constructed that looks like a well-known host but consists of
> slightly different characters. Preventing confusable domains,
> where possible, is one of the concerns of IDNAbis. In order to
> reduce the hazard of confusables, user agents need a consistent
> algorithm for performing a name lookup that takes into account
> all of the potential Unicode encodings (punycode, pct-encode, raw,
> etc.). Likewise, agents need to be able to normalize such names
> to a common format for the purpose of name comparison, such as
> within spam or virus checking software that filters on host.
>
> 4) IDNA exclusions are not enough.
>
> Although one might say that's a job for IDNAbis, the fact
> is that IDNA only extends to Internet registered names, not local
> or network-specific names, and user agents are not likely to
> implement host name restrictions that only work with IDNA-aware
> resolvers. Therefore, user agents need to be informed and possibly
> required to use a safe (or at least consistent) algorithm for the
> name lookup, even though (strictly speaking) there is no specific
> Internet standard for how to interface with a network name resolver
> when the supplied name is from a possibly untrusted source.
>
> 5) non-ASCII characters in host names is not all that new.
>
> IDNA and the increasing use of non-ASCII names may be a
> relatively new thing to the Internet, but many operating
> system name resolvers have allowed non-ASCII names for
> much longer. We therefore cannot assume that all non-ASCII
> names must be transformed via IDNA to an A-label. In any
> case, most user agents do not do so (but I haven't tested
> that lately).
>
> =======
>
> So, that's the issue. I'll resist the temptation to try to
> solve it in this message. Is this something that IRI should
> try to solve/suggest potential remedies, should a solution be
> hammered out by the URI folks and posted as an errata for 3986,
> or should a more general working group (dealing with all forms
> of name resolution lookup and not just those from URI refs)
> handle the problem? Did I forget any relevant bits above?
>
> Likewise, I'd like to know exactly what the current browsers
> and non-HTML user agents do when they perform a name lookup
> on various operating systems.
>
>
> Cheers,
>
> Roy T. Fielding <http://roy.gbiv.com/>
> Chief Scientist, Day Software <http://www.day.com/>
>
>
>
>
>
>

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
Received on Tuesday, 25 August 2009 10:02:35 UTC