Re: Error handling in URIs

Elliotte Harold wrote:

> 1. All numeric character references should be considered to point to 
> Unicode code points.

Done since about RFC 2070.

> 2. All percent escapes in documents should be considered to refer
> to UTF-8 bytes.

Not true, http://example.org/%C0%80 is a perfectly valid URI,
and certainly not UTF-8.  

> 3. The browser should convert all IRIs to pure URIs using 
> exclusively UTF-8 percent encoding as specified in the IRI spec.

Yes, since about RFC 3987.  The IRI itself can of course use the
encoding of its context, e.g., KOI8-R in a KOI8-R document:

http://hmdmhdfmhdjmzdtjmzdtzktdkztdjz.googlepages.com/IDN-IRI-koi8-r.html

> 4. If this fails because the UTF-8 in step 2 is ill-formed, redo
> step 2 assuming the encoding is ISO-8859-1 and pray.

Nobody uses iso-8859-1 for real, ITYM windows-1252.  For values
of "nobody" excluding RFC 2616 and 2617 among others, but when
you're going to guess try windows-1252.  
 
> I'm not sure about step 4. Maybe there's better error handling
> to be done, but steps 1-3 are the only sane approaches to this.

Yes, skip the prayer, URIs have no "default charset", that's a
historical accident limited to HTTP.

> Any scheme that attempts to replicate existing browser URL-encoding 
> behavior is doomed to failure, and will simply relegate us to ASCII
> only URIs for the foreseeable future.

URIs are "ASCII only" in the same sense as host names are ASCII only,
i.e. the proper subset as specified in STD 66, with a way to use any
octet in its percent-encoded form.

And IRIs are not limited to UTF-8.  Everything is perfect (ignoring
HTTP again).  "Redefining" URIs is a horror scenario.  Maybe folks
interested in selling new hard- and software like such disasters :-(

Of course an IRI without context needs to be UTF-8, guess and pray
is no recipe.  And it has to be transformed to an URI as specified
in RFC 3987 for practical purposes - preferably by the side knowing
how that works, e.g., the other side might not know the "punycode"
fine print for an <ihost>.

 Frank

Received on Saturday, 28 June 2008 13:19:27 UTC