W3C home > Mailing lists > Public > uri@w3.org > April 2002

Re: [idn] IRIs ought to use internationalized *host* names

From: Martin Duerst <duerst@w3.org>
Date: Thu, 04 Apr 2002 17:14:13 +0900
Message-Id: <4.2.0.58.J.20020329000302.02c68378@localhost>
To: IETF idn working group <idn@ops.ietf.org>
Cc: uri@w3.org
Hello Adam,

Sorry for the delay.
I'm splitting my answer into two. This one is about what
the IRI spec should say.


At 03:30 02/03/27 +0000, Adam M. Costello wrote:

>The IRI proposal (draft-masinter-url-i18n-08) calls for the host labels
>to be ASCII LDH only, just like in URIs.
>
>When converting an IRI to a URI, you have to convert the path components
>from the local charset to Unicode, then do Unicode normalization, UTF-8
>encoding, and %-escaping.  But you don't do anything to the host labels
>because they're already LDH.
>
>I suspect that the reason the IRI proponents don't internationalize the
>host field is that they don't yet have an official IDN spec to point at.
>When they do, I suspect they'll want to revise their proposal so that
>the host field can use the local charset.

This is very close, but it's actually a tiny little bit more
complicated.

The IRI proposal is based on defining various important aspects
of IRIs in terms of their mapping to URIs. This way, the IRI
proposal doesn't have to deal with questions such as "what's a resource".
The mapping is completely uniform, based on UTF-8 and %HH.
There are strong reasons for keeping it uniform, because you
can't really look into URIs (and therefore IRIs) in the general
case.

So there are three pieces:

1) Extending URIs to use non-ASCII characters (i.e. IRIs)

2) Extending the host name part of URIs to use %HH
    (in some browsers, that already works, in others, it doesn't,
     for ASCII; try e.g. http://www.w%33.org in a few browsers)

3) Extending IRIs to use non-ASCII characters in hostnames

The way the drafts are currently structured, it's:

draft-masinter-url-i18n-08: 1)

draft-ietf-idn-uri-01: 2) and 3)

It would be more straightforward to have it as follows:

draft-ietf-idn-uri: 2)

draft-masinter-url-i18n: 1) and 3)

The main advantage would be that people implementing IRIs
find everything to start in the same place, and don't have
to do the implementation in two stages.


>This raises the question of what characters should be allowed in host
>labels.  Since URIs do not allow arbitrary ASCII labels, only host
>labels restricted to LDH characters, one would expect, analogously,
>that IRIs would not allow arbitrary IDNs containing the exotic symbols
>and punctuation allowed by Nameprep, but would allow only host labels
>restricted to a selected set of characters,

There are two questions:

a) What are the allowed host names (see my other mail)
b) What should the URI/IRI specs say about it

The URI/IRI spec should not be the place where the syntax of allowed
host names is defined. The generic URI syntax currently has a very
careful definition of the domain name syntax, but as far as I am
aware, it is just a copy from somewhere else. And I think it was
just done that carefully because some of it was needed and it was
then easier to just do it all. I'm coping the uri@w3.org list to
get information on this point.

With IRIs, whatever the syntax of internationalized host names, it
will be much larger, and too big to fully copy it to the IRI spec.
Also, because new characters will be added to Unicode, it's a moving
target, and it's better to put it just in a single spec that can
then be updated.


draft-masinter-url-i18n-08 currently has a general clause to make
clear that even if the IRI syntax by itself allows quite a lot,


 >>>>
2.3 Mapping of IRIs to URIs

...

This mapping has two purposes:

   a) Syntactical: Many URI schemes and components define additional
      syntactical restrictions not captured in Section 2.2. Such
      restrictions can be applied to IRIs by noting that IRIs are only
      valid if they map to syntactically valid URIs. This means that
      such syntactical restrictions do not have to be defined again
      on the IRI level.
 >>>>

Of course, if there is something specific to point to for
host names, we would be very glad to actually do that.


>Getting back to IRIs and URIs:  I propose that conversion of an IRI
>to URI involve applying ToASCII to each host label.  This would allow
>conversion of any IRI to a URI without changing the syntax of URIs.  In
>contrast, the method proposed in draft-ietf-idn-uri-01 would change the
>URI syntax.

As said in another thread, this is not easy at all (i.e. impossible).
But let's continue the discussion of that aspect in the other thread.


Regards,   Martin.

#-#-#  Martin J. Du"rst, I18N Activity Lead, World Wide Web Consortium
#-#-#  mailto:duerst@w3.org   http://www.w3.org/People/D%C3%BCrst
Received on Thursday, 4 April 2002 03:15:03 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 13 January 2011 12:15:30 GMT