Re: query on iregname conversion

I'm a bit concerned about pct-decoding and then punycode-encoding. The
problem is that the implementation has no way of knowing what the
underlying encoding is. If it looks like UTF-8, then it can certainly
be converted to punycode, but what if it wasn't intended to be UTF-8
and just happened to look like well-formed UTF-8?

Then again, maybe there are too few pct-encoded non-UTF-8 domain names
to worry about. Here are the percentages of all hrefs on the Web:

pct-encoded non-UTF-8  0.0000001%
pct-encoded UTF-8 (non-ASCII)  0.000049%
not pct-encoded non-ASCII  0.0043%
punycode  0.023%

IE8 puts pct-encoded UTF-8 directly into DNS (without pct-decoding),
Firefox3.5 refuses to look such domain names up, Safari4 does
something very funky, and Chrome/Opera convert to Punycode.

Given this situation, I wonder if we could consider the following
alternative plans.

(1) If the domain name contains pct-encoded non-ASCII, reject the
entire URI/IRI. (Do something reasonable with pct-encoded ASCII.)

(2) If the domain name contains pct-encoded non-ASCII, pct-decode it
and check for well-formed UTF-8. If it is UTF-8, convert to Punycode.
If not, reject the URI/IRI. (Do something reasonable with pct-encoded
ASCII.)

Erik

Received on Thursday, 3 September 2009 01:50:50 UTC