Re: Error handling in URIs

John Cowan wrote:

> True but irrelevant: one of the purposes of the HTML5 effort is
> to document and standardize behavior which is neither.
> That means beginning where you are, not where you hypothetically
> ought to be.
> 

Unfortunately the behavior is sometimes irreconcilably inconsistent 
between existing browsers, and even more often inconsistent with common 
sense. This may be one of those cases since browser behavior was largely 
laid down in an ASCII-only, pre-Unicode mindset.

Bottom line:

1. All numeric character references should be considered to point to 
Unicode code points.
2. All percent escapes in documents should be considered to refer to 
UTF-8 bytes.
3. The browser should convert all IRIs to pure URIs using exclusively 
UTF-8 percent encoding as specified in the IRI spec.
4. If this fails because the UTF-8 in step 2 is ill-formed, redo step 2 
assuming the encoding is ISO-8859-1 and pray.

I'm not sure about step 4. Maybe there's better error handling to be 
done, but steps 1-3 are the only sane approaches to this. (Not that I'm 
convinced the HTML 5 effort is, in fact, sane, but one lives in hope.) 
Any scheme that attempts to replicate existing browser URL-encoding 
behavior is doomed to failure, and will simply relegate us to ASCII only 
URIs for the foreseeable future.

Absent an encoding declaration, there's just no alternative to 
specifying a single uniform encoding for all URIs. Unless we're sticking 
  with ASCII or 8859-1 (which clearly we shouldn't), that encoding is 
going to be UTF-8.

-- 
Elliotte Rusty Harold  elharo@metalab.unc.edu
Refactoring HTML Just Published!
http://www.amazon.com/exec/obidos/ISBN=0321503635/ref=nosim/cafeaulaitA

Received on Friday, 27 June 2008 15:04:03 UTC