- From: Roy T. Fielding <fielding@gbiv.com>
- Date: Mon, 10 Aug 2009 09:45:47 -0400
- To: public-iri@w3.org
- Cc: "John Klensin (klensin@jck.com)" <klensin@jck.com>
During the Stockholm IETF's BOF on IRIbis, John Klensin described a point of contention between the various hostname-related specifications for IDNAbis and the way that both 3986 and 3987 allow Unicode characters (in raw or pct-encoded form, or both) within the reg-name syntax and within other parts of the URI syntax (depending on scheme) that may hold a domain name. It is a difficult issue because it combines multiple standards, multiple name resolution APIs (only one of which is expected to comply with IDNA), and a huge number of deployed client implementations. Unfortunately, John was in a rush to attend another meeting and it was a bit late in the day for critical thinking, so I agreed to try to describe the issue (as I understood it) in writing to the group. Hopefully, John can add clarifications where I misunderstood. All of the following applies equally to Hrefs, IRIs, and URIs (via pct-encoded octets). The problem is as follows: 1) URIs often contain host names, and not just in authority. These are not always DNS names, as indicated in RFC 3986: This specification does not mandate a particular registered name lookup technology and therefore does not restrict the syntax of reg- name beyond what is necessary for interoperability. Instead, it delegates the issue of registered name syntax conformance to the operating system of each application performing URI resolution, and that operating system decides what it will allow for the purpose of host identification. A URI resolution implementation might use DNS, host tables, yellow pages, NetInfo, WINS, or any other system for lookup of registered names. However, a globally scoped naming system, such as DNS fully qualified domain names, is necessary for URIs intended to have global scope. URI producers should use names that conform to the DNS syntax, even when use of DNS is not immediately apparent, and should limit these names to no more than 255 characters in length. The reg-name syntax allows percent-encoded octets in order to represent non-ASCII registered names in a uniform way that is independent of the underlying name resolution technology. Non-ASCII characters must first be encoded according to UTF-8 [STD63], and then each octet of the corresponding UTF-8 sequence must be percent- encoded to be represented as URI characters. URI producing applications must not use percent-encoding in host unless it is used to represent a UTF-8 character sequence. When a non-ASCII registered name represents an internationalized domain name intended for resolution via the DNS, the name must be transformed to the IDNA encoding [RFC3490] prior to name lookup. URI producers should provide these registered names in the IDNA encoding, rather than a percent-encoding, if they wish to maximize interoperability with legacy URI resolvers. Note that 3986 does not indicate a specific algorithm for looking up such names in the name resolver. It doesn't even require that the pct-encoded octets be supplied to the name resolver in the *appropriate* encoding that is expected+supported by the resolver API, aside from the second-to-last sentence about IDNA, though conversion to the API character encoding was certainly what I was thinking. 2) DNS is a flexible, generic name lookup system. In principle, DNS can be used to lookup any kind of name. There is hardly any constraint on the syntax of names for lookup -- the constraints are on what is allowed to be *registered*. There is nothing to say that a pct-encoded triplet could not appear in DNS as a local name, though everyone seems to agree that is unlikely except for deliberate attempts to alias a non-encoded name. It is important, therefore, that user agents deal with the pct-encoded octets internally and *not* expect the resolver to translate them. 3) Aliases and confusables are dangerous. Phishing is a security problem, especially when a host name can be constructed that looks like a well-known host but consists of slightly different characters. Preventing confusable domains, where possible, is one of the concerns of IDNAbis. In order to reduce the hazard of confusables, user agents need a consistent algorithm for performing a name lookup that takes into account all of the potential Unicode encodings (punycode, pct-encode, raw, etc.). Likewise, agents need to be able to normalize such names to a common format for the purpose of name comparison, such as within spam or virus checking software that filters on host. 4) IDNA exclusions are not enough. Although one might say that's a job for IDNAbis, the fact is that IDNA only extends to Internet registered names, not local or network-specific names, and user agents are not likely to implement host name restrictions that only work with IDNA-aware resolvers. Therefore, user agents need to be informed and possibly required to use a safe (or at least consistent) algorithm for the name lookup, even though (strictly speaking) there is no specific Internet standard for how to interface with a network name resolver when the supplied name is from a possibly untrusted source. 5) non-ASCII characters in host names is not all that new. IDNA and the increasing use of non-ASCII names may be a relatively new thing to the Internet, but many operating system name resolvers have allowed non-ASCII names for much longer. We therefore cannot assume that all non-ASCII names must be transformed via IDNA to an A-label. In any case, most user agents do not do so (but I haven't tested that lately). ======= So, that's the issue. I'll resist the temptation to try to solve it in this message. Is this something that IRI should try to solve/suggest potential remedies, should a solution be hammered out by the URI folks and posted as an errata for 3986, or should a more general working group (dealing with all forms of name resolution lookup and not just those from URI refs) handle the problem? Did I forget any relevant bits above? Likewise, I'd like to know exactly what the current browsers and non-HTML user agents do when they perform a name lookup on various operating systems. Cheers, Roy T. Fielding <http://roy.gbiv.com/> Chief Scientist, Day Software <http://www.day.com/>
Received on Monday, 10 August 2009 13:46:09 UTC