- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Tue, 25 Aug 2009 19:01:32 +0900
- To: "Roy T. Fielding" <fielding@gbiv.com>, "John Klensin (klensin@jck.com)" <klensin@jck.com>
- CC: public-iri@w3.org
Hello Roy, Sorry for not replying earlier, I was on vacation. Many thanks for this very detailed summary, which is very helpful. Two questions for the moment; hopefully I will be able to comment in more detail later. First, why are there two (at least seemingly) identical emails (http://lists.w3.org/Archives/Public/public-iri/2009Aug/0010.html and http://lists.w3.org/Archives/Public/public-iri/2009Aug/0007.html)? Is this a simple 'clerical error', or something else? Second, while you give a lot of detailed background in your mail, I fail to see THE issue. I feel like I don't see the forest for all the trees. A simple summary (something like "1 and 3 and 4 together contradict 2 and 5 because of foo") might be enough to help me out. I would also be very glad to hear from John about whether he has anything else to add. Regards, Martin. On 2009/08/10 22:45, Roy T. Fielding wrote: > During the Stockholm IETF's BOF on IRIbis, John Klensin described a > point of contention between the various hostname-related specifications > for IDNAbis and the way that both 3986 and 3987 allow Unicode characters > (in raw or pct-encoded form, or both) within the reg-name syntax and > within other parts of the URI syntax (depending on scheme) that may > hold a domain name. It is a difficult issue because it combines > multiple standards, multiple name resolution APIs (only one of which > is expected to comply with IDNA), and a huge number of deployed > client implementations. > > Unfortunately, John was in a rush to attend another meeting and it > was a bit late in the day for critical thinking, so I agreed to try > to describe the issue (as I understood it) in writing to the group. > Hopefully, John can add clarifications where I misunderstood. > > All of the following applies equally to Hrefs, IRIs, and URIs > (via pct-encoded octets). The problem is as follows: > > 1) URIs often contain host names, and not just in authority. > > These are not always DNS names, as indicated in RFC 3986: > > This specification does not mandate a particular registered name > lookup technology and therefore does not restrict the syntax of reg- > name beyond what is necessary for interoperability. Instead, it > delegates the issue of registered name syntax conformance to the > operating system of each application performing URI resolution, and > that operating system decides what it will allow for the purpose of > host identification. A URI resolution implementation might use DNS, > host tables, yellow pages, NetInfo, WINS, or any other system for > lookup of registered names. However, a globally scoped naming > system, such as DNS fully qualified domain names, is necessary for > URIs intended to have global scope. URI producers should use names > that conform to the DNS syntax, even when use of DNS is not > immediately apparent, and should limit these names to no more than > 255 characters in length. > > The reg-name syntax allows percent-encoded octets in order to > represent non-ASCII registered names in a uniform way that is > independent of the underlying name resolution technology. Non-ASCII > characters must first be encoded according to UTF-8 [STD63], and then > each octet of the corresponding UTF-8 sequence must be percent- > encoded to be represented as URI characters. URI producing > applications must not use percent-encoding in host unless it is used > to represent a UTF-8 character sequence. When a non-ASCII registered > name represents an internationalized domain name intended for > resolution via the DNS, the name must be transformed to the IDNA > encoding [RFC3490] prior to name lookup. URI producers should > provide these registered names in the IDNA encoding, rather than a > percent-encoding, if they wish to maximize interoperability with > legacy URI resolvers. > > Note that 3986 does not indicate a specific algorithm for looking > up such names in the name resolver. It doesn't even require that > the pct-encoded octets be supplied to the name resolver in the > *appropriate* encoding that is expected+supported by the resolver API, > aside from the second-to-last sentence about IDNA, though conversion > to the API character encoding was certainly what I was thinking. > > 2) DNS is a flexible, generic name lookup system. > > In principle, DNS can be used to lookup any kind of name. There > is hardly any constraint on the syntax of names for lookup -- the > constraints are on what is allowed to be *registered*. There is > nothing to say that a pct-encoded triplet could not appear in DNS > as a local name, though everyone seems to agree that is unlikely > except for deliberate attempts to alias a non-encoded name. > It is important, therefore, that user agents deal with the > pct-encoded octets internally and *not* expect the resolver to > translate them. > > 3) Aliases and confusables are dangerous. > > Phishing is a security problem, especially when a host name can > be constructed that looks like a well-known host but consists of > slightly different characters. Preventing confusable domains, > where possible, is one of the concerns of IDNAbis. In order to > reduce the hazard of confusables, user agents need a consistent > algorithm for performing a name lookup that takes into account > all of the potential Unicode encodings (punycode, pct-encode, raw, > etc.). Likewise, agents need to be able to normalize such names > to a common format for the purpose of name comparison, such as > within spam or virus checking software that filters on host. > > 4) IDNA exclusions are not enough. > > Although one might say that's a job for IDNAbis, the fact > is that IDNA only extends to Internet registered names, not local > or network-specific names, and user agents are not likely to > implement host name restrictions that only work with IDNA-aware > resolvers. Therefore, user agents need to be informed and possibly > required to use a safe (or at least consistent) algorithm for the > name lookup, even though (strictly speaking) there is no specific > Internet standard for how to interface with a network name resolver > when the supplied name is from a possibly untrusted source. > > 5) non-ASCII characters in host names is not all that new. > > IDNA and the increasing use of non-ASCII names may be a > relatively new thing to the Internet, but many operating > system name resolvers have allowed non-ASCII names for > much longer. We therefore cannot assume that all non-ASCII > names must be transformed via IDNA to an A-label. In any > case, most user agents do not do so (but I haven't tested > that lately). > > ======= > > So, that's the issue. I'll resist the temptation to try to > solve it in this message. Is this something that IRI should > try to solve/suggest potential remedies, should a solution be > hammered out by the URI folks and posted as an errata for 3986, > or should a more general working group (dealing with all forms > of name resolution lookup and not just those from URI refs) > handle the problem? Did I forget any relevant bits above? > > Likewise, I'd like to know exactly what the current browsers > and non-HTML user agents do when they perform a name lookup > on various operating systems. > > > Cheers, > > Roy T. Fielding <http://roy.gbiv.com/> > Chief Scientist, Day Software <http://www.day.com/> > > > > > > -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Tuesday, 25 August 2009 10:02:35 UTC