- From: Martin Duerst <duerst@w3.org>
- Date: Thu, 04 Apr 2002 17:14:13 +0900
- To: IETF idn working group <idn@ops.ietf.org>
- Cc: uri@w3.org
Hello Adam, Sorry for the delay. I'm splitting my answer into two. This one is about what the IRI spec should say. At 03:30 02/03/27 +0000, Adam M. Costello wrote: >The IRI proposal (draft-masinter-url-i18n-08) calls for the host labels >to be ASCII LDH only, just like in URIs. > >When converting an IRI to a URI, you have to convert the path components >from the local charset to Unicode, then do Unicode normalization, UTF-8 >encoding, and %-escaping. But you don't do anything to the host labels >because they're already LDH. > >I suspect that the reason the IRI proponents don't internationalize the >host field is that they don't yet have an official IDN spec to point at. >When they do, I suspect they'll want to revise their proposal so that >the host field can use the local charset. This is very close, but it's actually a tiny little bit more complicated. The IRI proposal is based on defining various important aspects of IRIs in terms of their mapping to URIs. This way, the IRI proposal doesn't have to deal with questions such as "what's a resource". The mapping is completely uniform, based on UTF-8 and %HH. There are strong reasons for keeping it uniform, because you can't really look into URIs (and therefore IRIs) in the general case. So there are three pieces: 1) Extending URIs to use non-ASCII characters (i.e. IRIs) 2) Extending the host name part of URIs to use %HH (in some browsers, that already works, in others, it doesn't, for ASCII; try e.g. http://www.w%33.org in a few browsers) 3) Extending IRIs to use non-ASCII characters in hostnames The way the drafts are currently structured, it's: draft-masinter-url-i18n-08: 1) draft-ietf-idn-uri-01: 2) and 3) It would be more straightforward to have it as follows: draft-ietf-idn-uri: 2) draft-masinter-url-i18n: 1) and 3) The main advantage would be that people implementing IRIs find everything to start in the same place, and don't have to do the implementation in two stages. >This raises the question of what characters should be allowed in host >labels. Since URIs do not allow arbitrary ASCII labels, only host >labels restricted to LDH characters, one would expect, analogously, >that IRIs would not allow arbitrary IDNs containing the exotic symbols >and punctuation allowed by Nameprep, but would allow only host labels >restricted to a selected set of characters, There are two questions: a) What are the allowed host names (see my other mail) b) What should the URI/IRI specs say about it The URI/IRI spec should not be the place where the syntax of allowed host names is defined. The generic URI syntax currently has a very careful definition of the domain name syntax, but as far as I am aware, it is just a copy from somewhere else. And I think it was just done that carefully because some of it was needed and it was then easier to just do it all. I'm coping the uri@w3.org list to get information on this point. With IRIs, whatever the syntax of internationalized host names, it will be much larger, and too big to fully copy it to the IRI spec. Also, because new characters will be added to Unicode, it's a moving target, and it's better to put it just in a single spec that can then be updated. draft-masinter-url-i18n-08 currently has a general clause to make clear that even if the IRI syntax by itself allows quite a lot, >>>> 2.3 Mapping of IRIs to URIs ... This mapping has two purposes: a) Syntactical: Many URI schemes and components define additional syntactical restrictions not captured in Section 2.2. Such restrictions can be applied to IRIs by noting that IRIs are only valid if they map to syntactically valid URIs. This means that such syntactical restrictions do not have to be defined again on the IRI level. >>>> Of course, if there is something specific to point to for host names, we would be very glad to actually do that. >Getting back to IRIs and URIs: I propose that conversion of an IRI >to URI involve applying ToASCII to each host label. This would allow >conversion of any IRI to a URI without changing the syntax of URIs. In >contrast, the method proposed in draft-ietf-idn-uri-01 would change the >URI syntax. As said in another thread, this is not easy at all (i.e. impossible). But let's continue the discussion of that aspect in the other thread. Regards, Martin. #-#-# Martin J. Du"rst, I18N Activity Lead, World Wide Web Consortium #-#-# mailto:duerst@w3.org http://www.w3.org/People/D%C3%BCrst
Received on Thursday, 4 April 2002 03:15:03 UTC