Re: Are IDNs allowed in http IRIs? (Issue IDNhttp-19) from Martin Duerst on 2004-03-29 (public-iri@w3.org from March 2004)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 29 Mar 2004 14:38:09 -0500
To: public-iri@w3.org, uri@w3.org, public-iri@w3.org, uri@w3.org
Cc: "Roy T. Fielding" <fielding@gbiv.com>
Message-Id: <4.2.0.58.J.20040329120034.05a93b70@localhost>
Hello Adam,

I have assigned this issue IDNhttp-19, which I have tentatively closed
after the edits I made today (described below).


At 03:42 04/03/19 +0000, Adam M. Costello BOGUS address, see signature wrote:

>Are IDNs allowed in http IRIs?  It seems like a silly question, but
>currently the IRI draft tries to inherit the answer from the URI & HTTP
>specs, and the URI draft tries to inherit the answer from the HTTP spec,
>and of course the HTTP spec knows nothing of IDNs.  The result is that
>the question has no clear answer.  Does it need a clear answer, and if
>so, which document is the place to provide that answer?
>
>Consider the purported IRI http://jose'.example.net/.  Is that a valid
>http IRI?  If so, what does it mean?
>
>There are four places we might hope to find clues to answer those
>questions:  The IRI spec, the URI spec, the HTTP spec, and the IDNA
>spec.
>
>Here's what the IRI spec has to say on the matter:
>
>     IRIs are only valid if they map to syntactically valid URIs.

[I have looked at that paragraph, and have simplified it to eliminate
the word 'valid', because although it is used in RFC2396bis, it is
not used with respect to scheme-specific rules.

a) Syntactical: Many URI schemes and components define additional
    syntactical restrictions not captured in Section 2.2.
    Scheme-specific restrictions are applied to IRIs by converting
    IRIs to URIs and checking the URIs against the scheme-specific
    restrictions.]

>     the resource that the IRI locates is the same as the one located by
>     the URI obtained after converting the IRI

I added the following paragraph to fix the formal problems:

An IRI with a scheme that is known to use domain names in ireg-name,
but where the scheme definition does not allow %-escaping for ireg-name,
meets scheme-specific restrictions if either the straightforward
conversion or the conversion using the ToASCII operation on ireg-name
result in an URI that meets the scheme-specific restrictions.
An IRI with a scheme that is known to use domain names in ireg-name,
but where the scheme definition does not allow %-escaping for ireg-name,
resolves to the URI obtained after converting the IRI including using
the ToASCII operation on ireg-name. Implementations do not need to
do this conversion as long as they produce the same result.

Please note the extended text in the second line:
"but where the scheme definition does not allow %-escaping for ireg-name"
to be open to future/upgraded schemes that allow %-escaping in
reg-name/ireg-name.


>The only way for the URI http://jos%C3%A9.example.net/ to become valid
>and meaningful is for the HTTP URI spec to be updated, either by a
>successor to RFC-2616, or by a sweeping action of the successor to
>RFC-2396.

Well, RFC2396bis essentially does allow %-escaping, but it's not
worded in the 'sweeping action' way that you propose.


>For example, 2396bis could say that it updates all schemes
>that use domain names to use IDNs.  Such a sweeping update would of
>course explicitly bypass the interoperability protections designed into
>the IDNA architecture, and would let things break during the transition
>(it amounts to "just send UTF-8"), but I suppose it's an option.

I think IDNA was right to be very careful about interoperability
protections. But it did go a bit too far in that it tried to
strictly enforce exactly the same interoperability protection/
upgrade path model on all users of domain names. As you admit,
in different contexts, different ways to go about upgrading
may be more appropriate.


>The current 2396bis draft does not make such an update.  It says:
>
>     When a non-ASCII host name represents an internationalized domain
>     name intended for resolution via DNS, the name must be transformed
>     to the IDNA encoding [RFC3490] prior to name lookup.
>
>It does not tell us when a non-ASCII host name represents an IDN, so
>presumably that determination is left up to the scheme.

The assumption is that if something appears in a place where
the scheme uses DNS (that's what the 'intended for resolution via
DNS" is for), but it contains non-ASCII characters, then
it's an IDN. But it might help to make this more explicit.


>Existing
>schemes like http don't use non-ASCII names to represent IDNs (the IDNA
>spec makes that clear).
>
>Getting back to the original questions regarding the purported IRI
>http://jose'.example.net/, we have not been able to determine whether it
>is valid or what it means, because the answers change depending on which
>conversion to to URI we happen to use, and the IRI spec blesses both
>conversions.

I hope this is clarified now.


>Of course the same indeterminacy exists for ftp://jose'.example.net/

This is also fixed.


>and mailto:postmaster@jose'.example.net etc.

which, as I said, is an exception.

>Is this indeterminacy intended?
>
>If not, I can think of two ways to rectify the situation (and maybe
>other folks can think of other ways).  One way is for the URI spec to
>update all URI schemes along these lines:

I think updating the URI spec, or specific scheme specs, is possible,
but I prefer to not have to depend on any of this. So the new wording
in the IRI draft is intended to clarify the situation directly. If
schemes get updated, then all the better.


>     All schemes that use data types that use an ASCII-Compatible
>     Encoding (ACE) for internationalization are hereby updated to
>     allow the use of equivalent non-ASCII forms, represented as
>     percent-encoded UTF-8.  An example of such a data type is domain
>     names, see [IDNA].
>
>(Here I've generalized from IDNs to ACEs because there is a possibility
>that email address local parts will also use an ACE.)
>
>Alternatively, the URI spec could be left as-is, and the IRI spec could
>redefine the validity of IRIs along these lines:
>
>     An IRI is valid if it conforms to the syntax of the corresponding
>     URI except for the following two additional freedoms:
>
>        1) Wherever the URI might contain percent-encoded octets
>        representing UTF-8, the IRI may instead use non-ASCII characters
>        whose UTF-8 encoding is those same octets.
>
>        2) Wherever the URI might contain a data type that uses an
>        ASCII-Compatible Encoding (ACE) for internationalization, the IRI
>        may instead use an equivalent non-ASCII form.  An example of such
>        a data type is domain names, see [IDNA].

This is essentially what I have done, but on a smaller scale.
Opening this up to arbitrary ACEs in arbitrary positions would
be rather unworkable. It could go as far as specific query
parameters of specific URIs/IRIs (see the validator example
in the IRI draft).


>Maybe it would be bothersome to bless a conversion procedure that can
>produce invalid output.  If not, disregard the rest of this message.
>Otherwise, the possibility of invalid output could be avoided by
>creating a new "scheme name registration tree" (BCP-35).  For example,
>the "i" tree could be defined as follows:  For any scheme foo, the
>scheme i-foo is just like foo except for the following additional
>freedom:
>
>     Wherever the foo: URI might contain a data type that uses an
>     ASCII-Compatible Encoding (ACE) for internationalization, the
>     i-foo: URI may instead use an equivalent non-ASCII form, represented
>     as percent-encoded UTF-8.  An example of such a data type is domain
>     names, see [IDNA].

I don't think creating additional schemes is a viable solution
for URI internationalization.

Regards,    Martin.


>A scheme-agnostic IRI-to-URI converter can avoid the possibility of
>invalid output by prefixing i- to the front of the output.  When the URI
>eventually makes its way to an agent with scheme-specific knowledge,
>that agent can know whether the prefix can simply be discarded, or if
>ACE conversions need to be performed first.
>
>A URI-to-IRI converter can always discard the i- prefix, even without
>recognizing the scheme.
>
>AMC
>http://www.nicemice.net/amc/
Received on Monday, 29 March 2004 14:39:44 UTC