- From: Erik van der Poel <erikv@google.com>
- Date: Wed, 2 Sep 2009 18:50:10 -0700
- To: Larry Masinter <masinter@adobe.com>
- Cc: "Roy T. Fielding" <fielding@gbiv.com>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
I'm a bit concerned about pct-decoding and then punycode-encoding. The problem is that the implementation has no way of knowing what the underlying encoding is. If it looks like UTF-8, then it can certainly be converted to punycode, but what if it wasn't intended to be UTF-8 and just happened to look like well-formed UTF-8? Then again, maybe there are too few pct-encoded non-UTF-8 domain names to worry about. Here are the percentages of all hrefs on the Web: pct-encoded non-UTF-8 0.0000001% pct-encoded UTF-8 (non-ASCII) 0.000049% not pct-encoded non-ASCII 0.0043% punycode 0.023% IE8 puts pct-encoded UTF-8 directly into DNS (without pct-decoding), Firefox3.5 refuses to look such domain names up, Safari4 does something very funky, and Chrome/Opera convert to Punycode. Given this situation, I wonder if we could consider the following alternative plans. (1) If the domain name contains pct-encoded non-ASCII, reject the entire URI/IRI. (Do something reasonable with pct-encoded ASCII.) (2) If the domain name contains pct-encoded non-ASCII, pct-decode it and check for well-formed UTF-8. If it is UTF-8, convert to Punycode. If not, reject the URI/IRI. (Do something reasonable with pct-encoded ASCII.) Erik
Received on Thursday, 3 September 2009 01:50:50 UTC