RE: UTF-8 in URIs

A URI is a sequence of ASCII characters, with a subset chosen such that even EBCDIC can represent it. (Yes, EBCDIC was a concern.)

Yes, some URI schemes can define other uses for %xx-encoded octets, but this belongs in the scheme definition and not generic  processing.

An IRI is a sequence of Unicode characters.  It can be generally translated to a URI for processing by encoding in UTF-8 and %xx percent-hex-encoding the results.

There are a couple of exceptions recommended for browsers (only), but they shouldn't change the general rule:
  Exception 1: For interoperability with IDN, an IRI with a non-ASCII host name in the  authority, and the scheme is one
             of a small number of schemes (ftp, http), the IRI might also be translated into a URI where the host name is punicode encoded.

This exception is proposed because IDN servers only recognize the punicode representation of domain names
and not %xx hex-encoded.

   Exception 2: For IRIs embedded (only) in HTML (only) where the scheme is http (only),where the query portion of the IRI has non-ascii characters 
         and the HTML document character set is not ASCII, then the Query portion of the IRI (only) should be the %xx percent-hex-encoding
         of the query portion alone should proceed by representing the query string in the document charset and then hex-encoding that.

This exception is proposed as implementation-compatibility measures for browsers, because one
browser was released with a bug, web sites were coded to match the browser's bugs, and now
all the browsers need to mimic the broken behavior.


I see no justification for any additional use of non-UTF8 encodings used in the IRI -> URI translation.

If anyone has any evidence of any implementations using other encodings, could they please describe it?

We want to update the IRI spec under the apps area working group.e

Thanks,

Larry
--
http://larry.masinter.net

Received on Wednesday, 22 January 2014 19:40:11 UTC