- From: Erik van der Poel <erikv@google.com>
- Date: Wed, 2 Sep 2009 21:43:32 -0700
- To: "Roy T. Fielding" <fielding@gbiv.com>
- Cc: Larry Masinter <masinter@adobe.com>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
On Wed, Sep 2, 2009 at 7:31 PM, Roy T. Fielding<fielding@gbiv.com> wrote: > On Sep 2, 2009, at 6:50 PM, Erik van der Poel wrote: >> I'm a bit concerned about pct-decoding and then punycode-encoding. The >> problem is that the implementation has no way of knowing what the >> underlying encoding is. If it looks like UTF-8, then it can certainly >> be converted to punycode, but what if it wasn't intended to be UTF-8 >> and just happened to look like well-formed UTF-8? > > Then it isn't a valid name anyway. It is quite difficult to create > a domain name that uses UTF-8 octets but isn't actually UTF-8. I was referring to encodings other than UTF-8, e.g. ISO-8859-1, Shift_JIS, etc. >> Then again, maybe there are too few pct-encoded non-UTF-8 domain names >> to worry about. Here are the percentages of all hrefs on the Web: > > er, you mean "on Google" ... Google cannot see the entire Web > and it is a mistake to rely on spider coverage for protocol > decisions. True. But perhaps it is better to have some info about current usage, rather than no info. >> pct-encoded non-UTF-8 0.0000001% >> pct-encoded UTF-8 (non-ASCII) 0.000049% >> not pct-encoded non-ASCII 0.0043% >> punycode 0.023% >> >> IE8 puts pct-encoded UTF-8 directly into DNS (without pct-decoding), >> Firefox3.5 refuses to look such domain names up, Safari4 does >> something very funky, and Chrome/Opera convert to Punycode. > > Right, I would not expect any significant use of pct-encoded > (or raw non-ASCII) hostnames on the Internet today because they > are known to fail. Yes, they are known to fail in some browser versions. > The problem is that non-Internet domains are not limited to > ASCII and cannot use IDNA. For example, IRIs that are minted > inside a WINS-based network within a Russian corporation to > access its own intranet web site. We use the same software > to access those sites as we do the global Internet. It is a shame that URLs/URIs/IRIs were not designed with multiple name resolution protocols in mind. For example: http://example.com:12345/ The "http" tells us to use HTTP. But what is it that tells us to use WINS instead of DNS? Trying DNS first and then WINS seems like a hack. How long should the implementation wait for the DNS response? I don't know what to suggest here... >> Given this situation, I wonder if we could consider the following >> alternative plans. >> >> (1) If the domain name contains pct-encoded non-ASCII, reject the >> entire URI/IRI. (Do something reasonable with pct-encoded ASCII.) > > That is fine with me, though I'd be surprised if the browsers > were willing to stick to such a decision. Frankly, I'd be surprised too. >> (2) If the domain name contains pct-encoded non-ASCII, pct-decode it >> and check for well-formed UTF-8. If it is UTF-8, convert to Punycode. >> If not, reject the URI/IRI. (Do something reasonable with pct-encoded >> ASCII.) > > Also fine with me. > > What about domain names in raw non-ASCII? I believe the browsers are quite aligned here already. MSIE, Firefox, Safari, Chrome and Opera all convert the entire HTML file to Unicode, and then convert the domain names to Punycode. I have no idea about non-Web apps (such as email). Erik
Received on Thursday, 3 September 2009 04:44:19 UTC