Re: query on iregname conversion

On Sep 2, 2009, at 6:50 PM, Erik van der Poel wrote:

> I'm a bit concerned about pct-decoding and then punycode-encoding. The
> problem is that the implementation has no way of knowing what the
> underlying encoding is. If it looks like UTF-8, then it can certainly
> be converted to punycode, but what if it wasn't intended to be UTF-8
> and just happened to look like well-formed UTF-8?

Then it isn't a valid name anyway.  It is quite difficult to create
a domain name that uses UTF-8 octets but isn't actually UTF-8.

> Then again, maybe there are too few pct-encoded non-UTF-8 domain names
> to worry about. Here are the percentages of all hrefs on the Web:

er, you mean "on Google" ... Google cannot see the entire Web
and it is a mistake to rely on spider coverage for protocol
decisions.

> pct-encoded non-UTF-8  0.0000001%
> pct-encoded UTF-8 (non-ASCII)  0.000049%
> not pct-encoded non-ASCII  0.0043%
> punycode  0.023%
>
> IE8 puts pct-encoded UTF-8 directly into DNS (without pct-decoding),
> Firefox3.5 refuses to look such domain names up, Safari4 does
> something very funky, and Chrome/Opera convert to Punycode.

Right, I would not expect any significant use of pct-encoded
(or raw non-ASCII) hostnames on the Internet today because they
are known to fail.

The problem is that non-Internet domains are not limited to
ASCII and cannot use IDNA.  For example, IRIs that are minted
inside a WINS-based network within a Russian corporation to
access its own intranet web site.  We use the same software
to access those sites as we do the global Internet.

> Given this situation, I wonder if we could consider the following
> alternative plans.
>
> (1) If the domain name contains pct-encoded non-ASCII, reject the
> entire URI/IRI. (Do something reasonable with pct-encoded ASCII.)

That is fine with me, though I'd be surprised if the browsers
were willing to stick to such a decision.

> (2) If the domain name contains pct-encoded non-ASCII, pct-decode it
> and check for well-formed UTF-8. If it is UTF-8, convert to Punycode.
> If not, reject the URI/IRI. (Do something reasonable with pct-encoded
> ASCII.)

Also fine with me.

What about domain names in raw non-ASCII?

....Roy

Received on Thursday, 3 September 2009 02:35:42 UTC