- From: Martin Duerst <duerst@w3.org>
- Date: Thu, 01 May 2003 15:50:25 -0400
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- Cc: public-iri@w3.org, uri@w3.org
Hello Bjoern,
I have copied the URI list, because this is a topic of mutual concern.
At 05:23 03/04/29 +0200, Bjoern Hoehrmann wrote:
>* Martin Duerst wrote:
> >Issue http://www.w3.org/International/iri-edit#idnuri-02 is about
> >whether to use %-escaping or punycode to map the domain name part
> >of an IRI to an URI. This was discussed at the IETF in San Francisco,
> >and the general tendency there seemed to be towards using punycode.
>
>The hostname production rule in RFC 2396 (as updated by RFC 2732)
>does not allow %-escaping, using %-escaping for URI conversion is
>thus not an option, so why was this at all subject to discussion?
Because of a lot of reasons:
- Some user agents actually allow it (try using IE with http://www.w%33.org)
- The 'U' in URI stands (among else) for 'uniform', and so it is very
natural to use the same convention everywhere.
- IRIs are defined in many specs (XML, XLink, XML Schema,...) without
the exception for domain names, so there is code out there that
produces (or will produce, once it receives IDNs) %-escapes.
- URIs are (or were, Roy said he would change that) not limited to
DNS host names after the //: this field is (was) more general.
Obviously, forcing other mechanisms outside DNS to use punycode
would be impossible. The above change is crucial, but I'm still
not sure that betting URIs on using only DNS forever is a good idea.
- There are other fields in an URI/IRI which cannot be identified by
generic URI/IRI software as being host names; these have to use
%-escapes anyway.
- I wrote http://www.ietf.org/internet-drafts/draft-ietf-idn-uri-03.txt
for the IDN WG. This formally proposes an update to the URI spec.
- The implementation experience I have with Amaya/libwww suggests that
it is much easier and more natural to use %hh, at least internally.
Amaya doesn't know about punycode, it uses %hh throughout to talk
to lower layers. Libwww unescapes back, at a point it knows it
has a domain name, and then applies punycode. See
http://dev.w3.org/cvsweb/libwww/Library/src/HTDNS.c?rev=2.32.2.6&content-
http://dev.w3.org/cvsweb/libwww/Library/src/HTDNS.c?rev=2.32.2.6&content-typ
e=text/x-cvsweb-markup
for the actual code. In other words, the URI/IRI code doesn't need
to know about punycode, only the DNS-specific code needs to know.
The reasons I know against using %-escapes are:
- http proxies won't understand URIs with %-escapes in the host name part.
(Note that this in not a problem for the Host: field, which could
be defined just to match, and only needs to be treated by the
server that does IDNs, so it's not a general update problem. Obviously
it's also not a problem for direct access, because one way or the
other, user agents have to understand IDNs if they want to support them.)
- As you mention, syntax restrictions, although it is a long principle in
URI software design that you shouldn't check more than necessary, to
allow for extensibility.
Given the long list of pros, now that I have written this up, I'm
actually surprised that I made the change. In the long run, I'm
sure we would be much better of with using %-escapes in domain names.
Regards, Martin.
Received on Thursday, 1 May 2003 15:50:39 UTC