Re: uri handling of hosts is too restrictive

Roy T. Fielding <fielding@gbiv.com> wrote:

 > Applications that use DNS for the sake of host name resolution must
 > obey those restrictions -- I have tried to clarify that in the
 > specification.

Thanks, I generally like the clarification.

 > A host identified by a registered name is a string of characters
 > that is intended for lookup within a locally-defined host or service
 > name registry.  The most common of such registry mechanisms is the
 > Domain Name System (DNS), as defined by Section 3 of [RFC1034] and
 > Section 2.1 of [RFC1123].  A DNS name consists of a sequence of domain
 > labels...

This sounds like section 3 of [RFC1034] and section 2.1 of [RFC1123]
define the DNS registry mechanism, but they merely define the name
syntax.  I think the intention was:

     The most common of such registry mechanisms is the Domain Name
     System (DNS).  A host name intended for lookup in the DNS uses
     the syntax defined in section 3.5 of [RFC1034] and section 2.1 of
     [RFC1123].  Such a name consists of a sequence of domain labels...

 > When a non-ASCII host name represents an internationalized domain
 > name intended for resolution via DNS, the name must be transformed
 > to the IDNA encoding [RFC3490] prior to name lookup.  URI producers
 > should provide such host names in the IDNA encoding, rather than a
 > percent-encoding, if they wish to maximize interoperability with
 > legacy URI resolvers.

I think that understates the situation.  What do you think of this:

     When a non-ASCII reg-name represents an internationalized domain
     name (IDN), the rules of IDNA apply [RFC3490].  IDNA requires that
     the name be transformed to its ASCII-compatible encoding (ACE)
     sometime prior to being looked up in the DNS.  Furthermore, IDNA
     requires that any producer of an IDN as a protocol element use
     the ACE form unless it knows that the consumer understands IDNA.
     Therefore, in the absence of such knowledge, any URI producer that
     wishes to use a non-ASCII domain name in the host component of a URI
     is required by IDNA to use the ACE form, not a percent-encoded UTF-8
     form.  This requirement is needed in order to interoperate with
     legacy URI resolvers, which do not know how to convert to the ACE
     form prior to DNS lookup.

I know that creates a challenge for the IRI spec, but that is really
what IDNA implies.

The percent-encoding issue gives rise to compatibility considerations
analogous to the ones faced by IDNA, which could be addressed in an
analogous way:

     A previous version of the URI specification [RFC2396] did
     not permit percent-encoding within domain names in the host
     component, and there exist legacy URI resolvers that do not perform
     percent-decoding on domain names in the host component.  Therefore,
     a URI producer MUST NOT produce a URI with a host component
     containing a percent-encoded domain name (not even a percent-encoded
     ASCII domain name) unless the URI is being put into a context that
     explicitly invites such a URI.  This restriction applies only to
     domain names, not general reg-names.

A simpler approach that would subsume both paragraphs would be to just
continue the RFC-2396 prohibition of percent-escapes within domain names
in the host component:

     Although percent-encoding is generally allowed in a reg-name, it is
     not allowed in a reg-name that is a domain name, for compatibility
     with the previous version of the URI specification [RFC2396].
     Internationalized domain names can be supported using IDNA
     [RFC3490].

I don't think this stricter formulation is any more challenging for the
IRI spec, and it sure is a lot simpler than those two long paragraphs.

With either the long formulation or short strict formulation, the IRI
spec would face the same challenge, and I see that it's a formidable
one, but I'm not yet ready to give up hope of overcoming it.

AMC

Received on Monday, 16 February 2004 09:20:33 UTC