Re: query on iregname conversion

On Wed, Sep 2, 2009 at 7:31 PM, Roy T. Fielding<fielding@gbiv.com> wrote:
> On Sep 2, 2009, at 6:50 PM, Erik van der Poel wrote:
>> I'm a bit concerned about pct-decoding and then punycode-encoding. The
>> problem is that the implementation has no way of knowing what the
>> underlying encoding is. If it looks like UTF-8, then it can certainly
>> be converted to punycode, but what if it wasn't intended to be UTF-8
>> and just happened to look like well-formed UTF-8?
>
> Then it isn't a valid name anyway.  It is quite difficult to create
> a domain name that uses UTF-8 octets but isn't actually UTF-8.

I was referring to encodings other than UTF-8, e.g. ISO-8859-1, Shift_JIS, etc.

>> Then again, maybe there are too few pct-encoded non-UTF-8 domain names
>> to worry about. Here are the percentages of all hrefs on the Web:
>
> er, you mean "on Google" ... Google cannot see the entire Web
> and it is a mistake to rely on spider coverage for protocol
> decisions.

True. But perhaps it is better to have some info about current usage,
rather than no info.

>> pct-encoded non-UTF-8  0.0000001%
>> pct-encoded UTF-8 (non-ASCII)  0.000049%
>> not pct-encoded non-ASCII  0.0043%
>> punycode  0.023%
>>
>> IE8 puts pct-encoded UTF-8 directly into DNS (without pct-decoding),
>> Firefox3.5 refuses to look such domain names up, Safari4 does
>> something very funky, and Chrome/Opera convert to Punycode.
>
> Right, I would not expect any significant use of pct-encoded
> (or raw non-ASCII) hostnames on the Internet today because they
> are known to fail.

Yes, they are known to fail in some browser versions.

> The problem is that non-Internet domains are not limited to
> ASCII and cannot use IDNA.  For example, IRIs that are minted
> inside a WINS-based network within a Russian corporation to
> access its own intranet web site.  We use the same software
> to access those sites as we do the global Internet.

It is a shame that URLs/URIs/IRIs were not designed with multiple name
resolution protocols in mind. For example:

http://example.com:12345/

The "http" tells us to use HTTP. But what is it that tells us to use
WINS instead of DNS? Trying DNS first and then WINS seems like a hack.
How long should the implementation wait for the DNS response?

I don't know what to suggest here...

>> Given this situation, I wonder if we could consider the following
>> alternative plans.
>>
>> (1) If the domain name contains pct-encoded non-ASCII, reject the
>> entire URI/IRI. (Do something reasonable with pct-encoded ASCII.)
>
> That is fine with me, though I'd be surprised if the browsers
> were willing to stick to such a decision.

Frankly, I'd be surprised too.

>> (2) If the domain name contains pct-encoded non-ASCII, pct-decode it
>> and check for well-formed UTF-8. If it is UTF-8, convert to Punycode.
>> If not, reject the URI/IRI. (Do something reasonable with pct-encoded
>> ASCII.)
>
> Also fine with me.
>
> What about domain names in raw non-ASCII?

I believe the browsers are quite aligned here already. MSIE, Firefox,
Safari, Chrome and Opera all convert the entire HTML file to Unicode,
and then convert the domain names to Punycode.

I have no idea about non-Web apps (such as email).

Erik

Received on Thursday, 3 September 2009 04:44:19 UTC