- From: Roy T. Fielding <fielding@gbiv.com>
- Date: Wed, 2 Sep 2009 19:31:35 -0700
- To: Erik van der Poel <erikv@google.com>
- Cc: Larry Masinter <masinter@adobe.com>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
On Sep 2, 2009, at 6:50 PM, Erik van der Poel wrote: > I'm a bit concerned about pct-decoding and then punycode-encoding. The > problem is that the implementation has no way of knowing what the > underlying encoding is. If it looks like UTF-8, then it can certainly > be converted to punycode, but what if it wasn't intended to be UTF-8 > and just happened to look like well-formed UTF-8? Then it isn't a valid name anyway. It is quite difficult to create a domain name that uses UTF-8 octets but isn't actually UTF-8. > Then again, maybe there are too few pct-encoded non-UTF-8 domain names > to worry about. Here are the percentages of all hrefs on the Web: er, you mean "on Google" ... Google cannot see the entire Web and it is a mistake to rely on spider coverage for protocol decisions. > pct-encoded non-UTF-8 0.0000001% > pct-encoded UTF-8 (non-ASCII) 0.000049% > not pct-encoded non-ASCII 0.0043% > punycode 0.023% > > IE8 puts pct-encoded UTF-8 directly into DNS (without pct-decoding), > Firefox3.5 refuses to look such domain names up, Safari4 does > something very funky, and Chrome/Opera convert to Punycode. Right, I would not expect any significant use of pct-encoded (or raw non-ASCII) hostnames on the Internet today because they are known to fail. The problem is that non-Internet domains are not limited to ASCII and cannot use IDNA. For example, IRIs that are minted inside a WINS-based network within a Russian corporation to access its own intranet web site. We use the same software to access those sites as we do the global Internet. > Given this situation, I wonder if we could consider the following > alternative plans. > > (1) If the domain name contains pct-encoded non-ASCII, reject the > entire URI/IRI. (Do something reasonable with pct-encoded ASCII.) That is fine with me, though I'd be surprised if the browsers were willing to stick to such a decision. > (2) If the domain name contains pct-encoded non-ASCII, pct-decode it > and check for well-formed UTF-8. If it is UTF-8, convert to Punycode. > If not, reject the URI/IRI. (Do something reasonable with pct-encoded > ASCII.) Also fine with me. What about domain names in raw non-ASCII? ....Roy
Received on Thursday, 3 September 2009 02:35:42 UTC