- From: Larry Masinter <masinter@adobe.com>
- Date: Wed, 22 Jan 2014 19:39:29 +0000
- To: Zhong Yu <zhong.j.yu@gmail.com>, Michael Sweet <msweet@apple.com>
- CC: Martin J. Dürst <duerst@it.aoyama.ac.jp>, "Bjoern Hoehrmann" <derhoermi@gmx.net>, Gabriel Montenegro <Gabriel.Montenegro@microsoft.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, Osama Mazahir <OSAMAM@microsoft.com>, Dave Thaler <dthaler@microsoft.com>, Mike Bishop <Michael.Bishop@microsoft.com>, "Matthew Cox" <macox@microsoft.com>
A URI is a sequence of ASCII characters, with a subset chosen such that even EBCDIC can represent it. (Yes, EBCDIC was a concern.) Yes, some URI schemes can define other uses for %xx-encoded octets, but this belongs in the scheme definition and not generic processing. An IRI is a sequence of Unicode characters. It can be generally translated to a URI for processing by encoding in UTF-8 and %xx percent-hex-encoding the results. There are a couple of exceptions recommended for browsers (only), but they shouldn't change the general rule: Exception 1: For interoperability with IDN, an IRI with a non-ASCII host name in the authority, and the scheme is one of a small number of schemes (ftp, http), the IRI might also be translated into a URI where the host name is punicode encoded. This exception is proposed because IDN servers only recognize the punicode representation of domain names and not %xx hex-encoded. Exception 2: For IRIs embedded (only) in HTML (only) where the scheme is http (only),where the query portion of the IRI has non-ascii characters and the HTML document character set is not ASCII, then the Query portion of the IRI (only) should be the %xx percent-hex-encoding of the query portion alone should proceed by representing the query string in the document charset and then hex-encoding that. This exception is proposed as implementation-compatibility measures for browsers, because one browser was released with a bug, web sites were coded to match the browser's bugs, and now all the browsers need to mimic the broken behavior. I see no justification for any additional use of non-UTF8 encodings used in the IRI -> URI translation. If anyone has any evidence of any implementations using other encodings, could they please describe it? We want to update the IRI spec under the apps area working group.e Thanks, Larry -- http://larry.masinter.net
Received on Wednesday, 22 January 2014 19:40:11 UTC