Re: Are IDNs allowed in http IRIs?

"Roy T. Fielding" <fielding@gbiv.com> wrote:

> When 2396 is updated, all protocols that depend on 2396 (including
> HTTP) are automatically revised as a result -- that is the nature of a
> normative reference

Is that principle documented somewhere?

> As such, the http URI will be defined in terms of the new URI RFC as
> soon as that RFC is published, unless (or until) a revised 2616 is
> published that says differently.

Even if we accept that, and if the current draft of rfc2396bis is
published as an RFC, it still looks like percent-encoded non-ASCII
hostnames are not allowed in http: URIs.  The only thing rfc2396bis says
about IDNs is this:

    When a non-ASCII host name represents an internationalized domain
    name intended for resolution via DNS, the name must be transformed
    to the IDNA encoding [RFC3490] prior to name lookup.

It merely restates a fact that is already stated in IDNA, but does
not tell us when a reg-name represents an IDN (as opposed to a
non-domain-name); therefore it does not explicitly designate a protocol
element for carrying an IDN, which is a prerequisite for using a
non-ASCII domain name in a protocol element (according to IDNA).
Presumably an individual scheme spec could say that non-ASCII reg-names
in its host component do in fact represent IDNs, but of course the HTTP
spec does not say this for the http: scheme (because it predates IDNA).

Perhaps it was your intention for rfc2396bis to make a stronger
statement, something like:

    For any scheme that uses the reg-name component to hold domain
    names, percent-encoded non-ASCII names represent internationalized
    domain names, and therefore they must be transformed to ASCII prior
    to lookup in DNS, as specified in IDNA [RFC3490].

That would suffice if we knew that names in the host component of
http: URIs were domain names, but after the publication of rfc2396bis
we won't know that anymore.  Under RFC-2396, we knew that the foo
in http://foo/ was a host name (which is a kind of domain name)
because hostname was the only kind of name in the grammar for the host
component.  But if the citations in the HTTP spec to RFC-2396 are
implicitly redirected to RFC-2396bis, then foo is now a reg-name, which
is not necessarily a domain name, and therefore the stronger statement
above doesn't apply.  The HTTP spec never bothered to say that its names
were domain names, because the citation to RFC-2396 implied it.

In order to get non-ASCII domain names into http: URIs without reissuing
the HTTP spec, I think rfc2396bis would not only have to use the
stronger statement above, but also distinguish between a host and a
reg-name, for example:

    authority   = [ userinfo "@" ] coordinator [ ":" port ]

    A scheme can use either of two kinds of coordinators.  For schemes
    that use hosts identified by standard internet identifiers (IP
    addresses and domain names),

    coordinator = host
    host        = IP-literal / IPv4address / hostname

    For schemes that use hosts identified by other means, or non-hosts
    (like abstract namespace registries),

    coordinator = reg-name

    Generic URI parsers that don't know which kind of scheme they're
    dealing with can use

    coordinator = *( unreserved / pct-encoded / sub-delims )
                / "[" *( unreserved / sub-delims / ":" ) "]"

    In any URI for which either of the more specific coordinator rules
    matches, the less specific rule will also match the same substring.

This way, the HTTP spec's reference to the "host" token of RFC-2396,
which gets redirected to RFC-2396bis, would still imply that names are
domain names, and therefore the stronger statement about IDNs, if it
were included in RFC-2396bis, would apply to http: URIs, and non-ASCII
IDNs would be allowed (percent-encoded) in http: URIs.

Allowing non-ASCII host names in http: URIs would invite
interoperability problems with legacy browsers, but if you want it
anyway, here's a way to get it.

An alternative approach is to not try to get non-ASCII host names into
existing schemes.  New schemes could use non-ASCII host names (if their
specs say so), but existing schemes could not use them until their
individual scheme specs are revised, and each scheme could decide
whether it wanted to do that and incur the interoperability penalty.  In
the meantime, the IRI spec would have to face the issue of URI schemes
in which non-ASCII host names are permitted by the generic URI spec but
not by the scheme spec.

> If all of the HTTP implementations send the host subcomponent verbatim
> within the Host header field, then that is how the revision to 2616
> will be defined as well.

And until RFC-2616 is revised, if RFC-2396bis automatically updates the
http: URI syntax to allow percent-encoded non-ASCII host names, then
it also automatically updates the Host: field the same way, because
RFC-2616 uses the same token (host) in both places; therefore sending
the host subcomponent verbatim would be correct behavior.

AMC

Received on Sunday, 28 March 2004 17:42:35 UTC