Re: query on iregname conversion from Erik van der Poel on 2009-09-03 (public-iri@w3.org from September 2009)

From: Erik van der Poel <erikv@google.com>
Date: Wed, 2 Sep 2009 18:50:10 -0700
To: Larry Masinter <masinter@adobe.com>
Cc: "Roy T. Fielding" <fielding@gbiv.com>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <c07a32650909021850k39123a6dkc4028b772f8a9589@mail.gmail.com>

I'm a bit concerned about pct-decoding and then punycode-encoding. The
problem is that the implementation has no way of knowing what the
underlying encoding is. If it looks like UTF-8, then it can certainly
be converted to punycode, but what if it wasn't intended to be UTF-8
and just happened to look like well-formed UTF-8?

Then again, maybe there are too few pct-encoded non-UTF-8 domain names
to worry about. Here are the percentages of all hrefs on the Web:

pct-encoded non-UTF-8  0.0000001%
pct-encoded UTF-8 (non-ASCII)  0.000049%
not pct-encoded non-ASCII  0.0043%
punycode  0.023%

IE8 puts pct-encoded UTF-8 directly into DNS (without pct-decoding),
Firefox3.5 refuses to look such domain names up, Safari4 does
something very funky, and Chrome/Opera convert to Punycode.

Given this situation, I wonder if we could consider the following
alternative plans.

(1) If the domain name contains pct-encoded non-ASCII, reject the
entire URI/IRI. (Do something reasonable with pct-encoded ASCII.)

(2) If the domain name contains pct-encoded non-ASCII, pct-decode it
and check for well-formed UTF-8. If it is UTF-8, convert to Punycode.
If not, reject the URI/IRI. (Do something reasonable with pct-encoded
ASCII.)

Erik

Received on Thursday, 3 September 2009 01:50:50 UTC