Re: Are IDNs allowed in http IRIs? from Martin Duerst on 2004-03-29 (uri@w3.org from March 2004)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 29 Mar 2004 14:43:01 -0500
To: uri@w3.org, public-iri@w3.org, uri@w3.org, public-iri@w3.org
Message-Id: <4.2.0.58.J.20040329114046.05d4df10@localhost>
At 03:03 04/03/29 +0000, Adam M. Costello BOGUS address, see signature wrote:

>I think it may help conceptually to distinguish between generic URIs and
>URIs of a particular scheme.  For example, foo:bar is a valid generic
>URI, but it might not be a valid foo: URI.
>
>Given a foo: IRI, there are two kinds of IRI -> URI conversion we might
>be interested in.  We might merely want to convert to a valid generic
>URI in order to tunnel it through some ASCII infrastructure before being
>ultimately resolved by an IRI resolver, in which case we don't actually
>need a valid foo: URI.  Or we might want to convert to a valid foo: URI
>so that a foo: URI resolver can resolve it.
>
>The former of those two tasks can obviously be done without
>scheme-specific knowledge.  But if you want the URI to be
>resolvable, you can't get the job done without scheme-specific
>knowledge.  Many (most?) browsers today will not resolve
>http://%E7%8C%AB%E3%81%AB%E5%B0%8F%E5%88%A4.nicemice.net/ because
>they don't handle percent-encoding in the host component, because
>it's not valid according to the current http: URI spec.

Opera 7.2 resolves this. My personal version of Amaya
(compiling with the 'IDN' branch of libwww) resolves this.

My expectation is that browsers that do both IRIs (in general)
and IDNs will resolve the above also, just as a side effect of
how the various steps of resolution work, or can easily be
made to resolve this by really small changes.


>(Even
>Firefox, which fetches the page when I type the Japanese domain
>label directly in Japanese, cannot resolve the percent-encoded
>UTF-8 form.)  Many (most?) browsers today will not resolve
>mailto:webmaster@%E7%8C%AB%E3%81%AB%E5%B0%8F%E5%88%A4.nicemice.net even
>though they percent-decode the domain, because they (or the MUA they
>invoke) are not expecting UTF-8 in the mail address, because it's not
>valid according to the current mail address spec.

mailto: is a different problem, because it does not use what's
called the 'generic' syntax in RFC 2396.


>If you want URIs that
>are meaningful according to current specs and can be resolved by current
>URI resolvers, you need http://xn--r9j282hvzgc6x.nicemice.net/ and
>mailto:webmaster@xn--r9j282hvzgc6x.nicemice.net.
>
> > Do we need a separate spec for "http:", "mailto:", "ftp:" IRIs, where
> > each specifies the punycode vs. hex-encoding of the various parts?
>
>I don't think so; I think it is sufficient to have separate specs for
>http, mailto, and ftp URIs, which we already have.  The conversion of
>IRIs to URIs can then be defined by general rules in the IRI spec.  For
>example, see the latter part of
>
>http://lists.w3.org/Archives/Public/uri/2004Mar/0049.html
>
>after "redefine the validity of IRIs".
>
>Martin Duerst <duerst@w3.org> wrote:
>
> > At most, there should be a single bit per scheme that says whether
> > punycode should be applied to the 'host' part.
>
>This middle-ground approach is both too much (every IRI-to-URI convertor
>needs to recognize all schemes)

Well, all schemes deployed currently that use DNS names in the
'host' slot of the generic syntax. New schemes can easily
be defined to allow the %-escaping syntax.


>and too little (mailto: is not handled).

As I said above, it's a separate issue. In my opinion, once we have
some good idea of where i18n of mail addresses is going, we will have
to update the mailto: RFC.


>What do you think of recognizing two kinds of IRI-to-URI conversion, a
>scheme-agnostic kind for tunneling through URI infrastructure to IRI
>resolvers, and a fully scheme-aware kind for interoperating with URI
>resolvers (and for defining the meaning of the IRI)?

I think we are actually very close to this. But I don't want to
make this more explicit than necessary.


>In other words, there are three data types:  IRIs are non-ASCII and
>can be resolved by IRI resolvers.  URIs are ASCII and can be resolved
>by legacy URI resolvers.  HRIs (hybrid resource identifiers) are both
>IRIs and generic URIs; they can be resolved by IRI resolvers but not
>by legacy URI resolvers; they can traverse infrastructure that accepts
>generic URIs without needing to resolve them; and they survive relative
>URI processing.  Conversion between IRIs and HRIs (in either direction)
>needs no scheme-specific knowledge, but conversion to URIs does.

I don't think we need to be that explicit. To the extent that there is
a need for 'tunneling' (I haven't seen any explicit examples for this
yet), that will just work out anyway.



>This model recognizes the inescapable fact that in order for something
>like mailto:postmaster@jose'.example.net to be useful, two pieces of
>knowledge need to come together:  IDNA (which is internationalization
>knowledge), and the fact that the thing after the at-sign is a domain
>name (which is mailto knowledge).  The three-type model makes it clear
>that you can either smarten up the IRI layer with scheme knowledge or
>smarten up the URI resolution layer with IRI knowledge, but one or
>the other is required, and therefore the meaning of something like
>mailto:postmaster@jose'.example.net can be well-defined.  I think the
>two-type model lets this requirement slip through the cracks; both
>layers try to limit their knowledge at the same time, but that just
>doesn't work.

In actual implementations, what you may have to do is just to
exchange (or beef up) your dns resolving code with something
that understands the %-escaped stuff as UTF-8 and knows how to
convert this to punycode.


> > But to a large extent, this is actually an implementation issue.
>
>Before I can begin to plan or assess an implementation, I need to know
>what it intends to implement.  What is http://jose'.example.net/ supposed
>to mean?  The IRI spec says it is supposed to mean the same thing as
>ToURI(http://jose'.example.net/), but that is not deterministic; it is
>both http://jos%C3%A9.example.net/ and http://xn--jos-dma.example.net/,
>and while the latter has a meaning, the former doesn't (at least, not
>with rfc2396bis as it stands, see my previous message).
>
>In order for the IRI http://jose'.example.net/ to have a well-defined
>meaning, so that we know what our implementation is aiming for, we need
>to tweak either the HTTP spec, the URI spec, the IDNA spec, or the IRI
>spec.  Since the URI and IRI specs are the ones in flux at the moment,
>they seem like the prime candidates.

The IRI spec has been tweaked. I hope this is okay.


Regards,    Martin.
Received on Monday, 29 March 2004 14:43:36 UTC