Re: issue idnuri-02: New approach, new text from Martin Duerst on 2003-05-01 (public-iri@w3.org from May 2003)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 01 May 2003 15:50:25 -0400
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: public-iri@w3.org, uri@w3.org
Message-ID: <1059641618.IAA22192@phantom.w3.org>
Hello Bjoern,

I have copied the URI list, because this is a topic of mutual concern.

At 05:23 03/04/29 +0200, Bjoern Hoehrmann wrote:
>* Martin Duerst wrote:
> >Issue http://www.w3.org/International/iri-edit#idnuri-02 is about
> >whether to use %-escaping or punycode to map the domain name part
> >of an IRI to an URI. This was discussed at the IETF in San Francisco,
> >and the general tendency there seemed to be towards using punycode.
>
>The hostname production rule in RFC 2396 (as updated by RFC 2732)
>does not allow %-escaping, using %-escaping for URI conversion is
>thus not an option, so why was this at all subject to discussion?

Because of a lot of reasons:

- Some user agents actually allow it (try using IE with http://www.w%33.org)

- The 'U' in URI stands (among else) for 'uniform', and so it is very
   natural to use the same convention everywhere.

- IRIs are defined in many specs (XML, XLink, XML Schema,...) without
   the exception for domain names, so there is code out there that
   produces (or will produce, once it receives IDNs) %-escapes.

- URIs are (or were, Roy said he would change that) not limited to
   DNS host names after the //: this field is (was) more general.
   Obviously, forcing other mechanisms outside DNS to use punycode
   would be impossible. The above change is crucial, but I'm still
   not sure that betting URIs on using only DNS forever is a good idea.

- There are other fields in an URI/IRI which cannot be identified by
   generic URI/IRI software as being host names; these have to use
   %-escapes anyway.

- I wrote http://www.ietf.org/internet-drafts/draft-ietf-idn-uri-03.txt
   for the IDN WG. This formally proposes an update to the URI spec.

- The implementation experience I have with Amaya/libwww suggests that
   it is much easier and more natural to use %hh, at least internally.
   Amaya doesn't know about punycode, it uses %hh throughout to talk
   to lower layers. Libwww unescapes back, at a point it knows it
   has a domain name, and then applies punycode. See
   http://dev.w3.org/cvsweb/libwww/Library/src/HTDNS.c?rev=2.32.2.6&content- 
http://dev.w3.org/cvsweb/libwww/Library/src/HTDNS.c?rev=2.32.2.6&content-typ 
e=text/x-cvsweb-markup
   for the actual code. In other words, the URI/IRI code doesn't need
   to know about punycode, only the DNS-specific code needs to know.


The reasons I know against using %-escapes are:
- http proxies won't understand URIs with %-escapes in the host name part.
   (Note that this in not a problem for the Host: field, which could
    be defined just to match, and only needs to be treated by the
    server that does IDNs, so it's not a general update problem. Obviously
    it's also not a problem for direct access, because one way or the
    other, user agents have to understand IDNs if they want to support them.)
- As you mention, syntax restrictions, although it is a long principle in
   URI software design that you shouldn't check more than necessary, to
   allow for extensibility.


Given the long list of pros, now that I have written this up, I'm
actually surprised that I made the change. In the long run, I'm
sure we would be much better of with using %-escapes in domain names.


Regards,   Martin.
Received on Thursday, 1 May 2003 15:50:39 UTC