W3C home > Mailing lists > Public > uri@w3.org > May 2003

Re: issue idnuri-02: New approach, new text

From: Martin Duerst <duerst@w3.org>
Date: Tue, 06 May 2003 19:15:32 -0400
Message-Id: <4.2.0.58.J.20030506182321.06015d60@localhost>
To: Dan Connolly <connolly@w3.org>, "Roy T. Fielding" <fielding@apache.org>
Cc: Bjoern Hoehrmann <derhoermi@gmx.net>, public-iri@w3.org, uri@w3.org

At 16:59 03/05/06 -0500, Dan Connolly wrote:
>On Tue, 2003-05-06 at 16:49, Roy T. Fielding wrote:
> > > These are reasons to change RFC 2396 in a way that allows %-escapes
> > > in the hostname component (and probably other components). Has this
> > > been considered and refused?
> >
> > Yes, it has been considered and refused.
>
>If I'm following correctly, that's documented at...
>http://www.apache.org/~fielding/uri/rev-2002/issues.html#036-host-escaping

Yes.


> >   The IETF developed IDNA in
> > order to avoid the need for operating system infrastructure to be
> > updated en masse prior to deployment of i18n domains.  URI processors
> > are part of that infrastructure and the rationale for not changing them
> > is the same as that provided for not globally changing the
> > implementations
> > of BIND.

URI processors are clearly part of infrastructure. But the decision
for IDNA wasn't that everything else should use punycode blindly.
The various issues have to be reconsidered at each level.
In general, URI processors are quite a bit closer to actual
user-oriented applications than DNS calls.


> > IRIs have to be processed by applications that accept the burden of
> > full Unicode processing already.  URIs do not.  Adding punycode
> > interpretation to the processing of URIs or gethostbyname simply
> > will not happen because that technology is already deployed.  Thus,
> > in order to make deployment possible, punycode processing moves up
> > a layer and URIs are specified such that it becomes easier for the
> > IRI processor to determine where it is needed.

I think we should be very careful about initial deployment
(for which I agree) and longer-term deployment, for which
I have to disagree.


>Has anybody got an example to show how this works?

For example servers that actually work, see
http://www.w3.org/2003/Talks/0425-duerst-idniri/slide12-0.html.
For each bullet, the first link uses the actual IDN
(internationalized domain name), the second link is the
mangled punycode equivalent. Unless you have a *very* new
browser or a special IDN plugin, only the second ones will work.

If you want to test out how some actual IDN converts to punycode,
I suggest looking at http://josefsson.org/idn.php. Just enter
your IDN (or copy/paste), and you get the punycode back.


>I'm trying to collecte examples/tests, if only to keep
>all this stuff straight in my own mind.

For actual code, see my recent commits to libwww (on the IDN branch),
e.g. http://dev.w3.org/cvsweb/libwww/Library/src/HTDNS.c
This actually compiles and works on my machine (Win2000), but
hasn't been tested further than that. This is based on
idnkit by JPNIC (where I gave a talk 1.5 weeks ago, see
slides above).

Please note that while I agree with Roy that there is an initial
deployment problem, when anybody actually gets to implement IDNs,
it makes sense to implement them (i.e. do the IDN->punycode conversion)
as low as possible. In libwww, I did that just before the call
to 'gethostbyname'. The main other thing I had to add was the Host:
header. But defining the Host: header to be %-escaped, or even to
be in UTF-8, would not cause any deployment problems that I know of
(this stuff is new, we just have to tell people what to do).
The main issue that Roy's approach makes easier are proxies.
We have to assume that clients who want to use IDNs have to
update in some way, and we also can assume that servers serving
IDNs can be asked to do some upgrade. Proxies are in the middle,
and expecting them to upgrade soon is difficult.

Another datapoint showing that actual implementations will tend
to go as low as possible in the stack is idnkit itself. It
actually allows to patch binaries, if the software is e.g.
written according to the posix locale model.

So I expect that in languages such as python and perl, which
are getting better and better at their character encoding model,
the equivalent of 'gethostbyname' will accept IDNs. Initially,
that will be implemented by using an outside idn library, but
later, it will be using the idn code provided by the OS.


Another case that it worth considering is domain names in other
positions that 'authority'. Roy said:
 >>>>>>>>
Schemes that use DNS within components other than authority will have
to provide their own percent-encoding-to-punycode processing, but that's
no big deal because there are no such schemes deployed that actually
use the domain name for DNS access (they simply use it for identification,
which does not require punycode).
 >>>>>>>>

I'm not sure I understand that. A simple HTTP URI, such as the one
sent to the W3C validator, can contain a domain name. People trying
to validate URIs with IDNs won't want to do a punycode conversion in
their head. So the validator will have to be upgraded to accept things
such as
http://validator.w3.org/check?uri=http%3A%2F%2F%E7%B4%8D%E8%B1%86.w3.mag.kei 
o.ac.jp.


Regards,     Martin.
Received on Tuesday, 6 May 2003 19:17:27 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 13 January 2011 12:15:31 GMT