RE: Are IDNs allowed in http IRIs? from Martin Duerst on 2004-03-22 (public-iri@w3.org from March 2004)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 22 Mar 2004 11:59:43 -0500
To: Larry Masinter <LMM@acm.org>, "'Michel Suignard'" <michelsu@windows.microsoft.com>, public-iri@w3.org, uri@w3.org
Message-Id: <4.2.0.58.J.20040322114133.05e5ce78@localhost>

At 09:20 04/03/19 -0800, Larry Masinter wrote:

>It seems like a pretty big change to the IRI concept to have
>IRI -> URI transformations use scheme-specific knowledge.
>Formerly, IRI -> URI transformation was specified as scheme
>independent.

Yes. I definitely want to keep it that way as much as possible.
At most, there should be a single bit per scheme that says
whether punycode should be applied to the 'host' part. But
hopefully, even that should be transitory. In no way do I
want this to be a slippery slope.

>I understand that this seems necessary because of IDN, but
>it's a big concern.
>
>And to use "SHOULD" would leave us subject to
>the indeterminate knowledge of which method is going to
>be used.

Yes. But to a large extent, this is actually an implementation
issue. For all those cases where IRIs are resolved directly,
it just boils down to whether you implement the IDN->Punycode
conversion before passing 'host' to your dns resolver code,
or whether you beef up the code that calls the dns resolver
with IDN. That should be left to the application.

Overall, there are three ways to use IRIs:

1) direct resolution (e.g. in a browser)
2) use just as an identifier (e.g. as a namespace)
3) conversion to URI (e.g. for proxies,...)

The question really only turns up in case 3). In the IRI draft,
everything is described in terms of 3), because this helps a
lot in writing the spec, it avoids copying all the details about
what an URI is, how the generic syntax works, and so on. In
practice, straghtforward, full conversion is somewhat less
important (which does not mean that it should not be defined).

>If you believe that IRI->URI needs to be scheme specific,
>then is it really tractable to define IRI as a generic concept?
>Do we need a separate spec for "http:", "mailto:", "ftp:" IRIs,
>where each specifies the punycode vs. hex-encoding of the
>various parts?

I very much hope not. As I have documented in the IRI draft,
if you want to try and make everything just 'work', the
scheme is not the limit. Host names can appear as query
parameters in IRIs, and only (if necessary) an update of
the query interface (and script behind it) can actually
let users use IRI functionality in that case. Look for
the 'validator' example in the draft.

>For example, a 'data' IRI might need a separate
>specification for the default MIME type (since text/plain
>defaults US-ASCII).

Maybe yes. For the moment, one would have to write:
data:text/plain;charset=utf-8,SOME%20DATA
(where SOME and DATA would be non-ASCII characters).
But how much is data: actually used just for text?

>And an IRI URN might also have a
>different mapping, etc.

As mentioned in the IRI draft, URNs are already defined to
use UTF-8 if there are some non-ASCII characters. So they
work together perfectly with IRIs.

Regards,   Martin.

Received on Monday, 22 March 2004 15:36:04 UTC