issue: handling non-ascii hostnames in URIs (and IRIs/Hrefs) from Roy T. Fielding on 2009-08-08 (public-iri@w3.org from August 2009)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Fri, 7 Aug 2009 17:04:09 -0700
To: public-iri@w3.org
Cc: "John Klensin (klensin@jck.com)" <klensin@jck.com>
Message-Id: <E9597EB2-8855-486D-BB0A-39583E900D30@gbiv.com>
During the Stockholm IETF's BOF on IRIbis, John Klensin described a
point of contention between the various hostname-related specifications
for IDNAbis and the way that both 3986 and 3987 allow Unicode characters
(in raw or pct-encoded form, or both) within the reg-name syntax and
within other parts of the URI syntax (depending on scheme) that may
hold a domain name.  It is a difficult issue because it combines
multiple standards, multiple name resolution APIs (only one of which
is expected to comply with IDNA), and a huge number of deployed
client implementations.

Unfortunately, John was in a rush to attend another meeting and it
was a bit late in the day for critical thinking, so I agreed to try
to describe the issue (as I understood it) in writing to the group.
Hopefully, John can add clarifications where I misunderstood.

All of the following applies equally to Hrefs, IRIs, and URIs
(via pct-encoded octets).  The problem is as follows:

1) URIs often contain host names, and not just in authority.

These are not always DNS names, as indicated in RFC 3986:

    This specification does not mandate a particular registered name
    lookup technology and therefore does not restrict the syntax of reg-
    name beyond what is necessary for interoperability.  Instead, it
    delegates the issue of registered name syntax conformance to the
    operating system of each application performing URI resolution, and
    that operating system decides what it will allow for the purpose of
    host identification.  A URI resolution implementation might use DNS,
    host tables, yellow pages, NetInfo, WINS, or any other system for
    lookup of registered names.  However, a globally scoped naming
    system, such as DNS fully qualified domain names, is necessary for
    URIs intended to have global scope.  URI producers should use names
    that conform to the DNS syntax, even when use of DNS is not
    immediately apparent, and should limit these names to no more than
    255 characters in length.

    The reg-name syntax allows percent-encoded octets in order to
    represent non-ASCII registered names in a uniform way that is
    independent of the underlying name resolution technology.  Non-ASCII
    characters must first be encoded according to UTF-8 [STD63], and  
then
    each octet of the corresponding UTF-8 sequence must be percent-
    encoded to be represented as URI characters.  URI producing
    applications must not use percent-encoding in host unless it is used
    to represent a UTF-8 character sequence.  When a non-ASCII  
registered
    name represents an internationalized domain name intended for
    resolution via the DNS, the name must be transformed to the IDNA
    encoding [RFC3490] prior to name lookup.  URI producers should
    provide these registered names in the IDNA encoding, rather than a
    percent-encoding, if they wish to maximize interoperability with
    legacy URI resolvers.

Note that 3986 does not indicate a specific algorithm for looking
up such names in the name resolver.  It doesn't even require that
the pct-encoded octets be supplied to the name resolver in the
*appropriate* encoding that is expected+supported by the resolver API,
aside from the second-to-last sentence about IDNA, though conversion
to the API character encoding was certainly what I was thinking.

2) DNS is a flexible, generic name lookup system.

In principle, DNS can be used to lookup any kind of name.  There
is hardly any constraint on the syntax of names for lookup -- the
constraints are on what is allowed to be *registered*.  There is
nothing to say that a pct-encoded triplet could not appear in DNS
as a local name, though everyone seems to agree that is unlikely
except for deliberate attempts to alias a non-encoded name.
It is important, therefore, that user agents deal with the
pct-encoded octets internally and *not* expect the resolver to
translate them.

3) Aliases and confusables are dangerous.

Phishing is a security problem, especially when a host name can
be constructed that looks like a well-known host but consists of
slightly different characters.  Preventing confusable domains,
where possible, is one of the concerns of IDNAbis.  In order to
reduce the hazard of confusables, user agents need a consistent
algorithm for performing a name lookup that takes into account
all of the potential Unicode encodings (punycode, pct-encode, raw,
etc.).  Likewise, agents need to be able to normalize such names
to a common format for the purpose of name comparison, such as
within spam or virus checking software that filters on host.

4) IDNA exclusions are not enough.

Although one might say that's a job for IDNAbis, the fact
is that IDNA only extends to Internet registered names, not local
or network-specific names, and user agents are not likely to
implement host name restrictions that only work with IDNA-aware
resolvers.  Therefore, user agents need to be informed and possibly
required to use a safe (or at least consistent) algorithm for the
name lookup, even though (strictly speaking) there is no specific
Internet standard for how to interface with a network name resolver
when the supplied name is from a possibly untrusted source.

5) non-ASCII characters in host names is not all that new.

IDNA and the increasing use of non-ASCII names may be a
relatively new thing to the Internet, but many operating
system name resolvers have allowed non-ASCII names for
much longer.  We therefore cannot assume that all non-ASCII
names must be transformed via IDNA to an A-label.  In any
case, most user agents do not do so (but I haven't tested
that lately).

=======

So, that's the issue.  I'll resist the temptation to try to
solve it in this message.  Is this something that IRI should
try to solve/suggest potential remedies, should a solution be
hammered out by the URI folks and posted as an errata for 3986,
or should a more general working group (dealing with all forms
of name resolution lookup and not just those from URI refs)
handle the problem?  Did I forget any relevant bits above?

Likewise, I'd like to know exactly what the current browsers
and non-HTML user agents do when they perform a name lookup
on various operating systems.


Cheers,

Roy T. Fielding                            <http://roy.gbiv.com/>
Chief Scientist, Day Software              <http://www.day.com/>
Received on Saturday, 8 August 2009 00:27:52 UTC