- From: Elliotte Harold <elharo@metalab.unc.edu>
- Date: Fri, 27 Jun 2008 08:03:22 -0700
- To: John Cowan <cowan@ccil.org>
- Cc: Frank Ellermann <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>, uri@w3.org
John Cowan wrote: > True but irrelevant: one of the purposes of the HTML5 effort is > to document and standardize behavior which is neither. > That means beginning where you are, not where you hypothetically > ought to be. > Unfortunately the behavior is sometimes irreconcilably inconsistent between existing browsers, and even more often inconsistent with common sense. This may be one of those cases since browser behavior was largely laid down in an ASCII-only, pre-Unicode mindset. Bottom line: 1. All numeric character references should be considered to point to Unicode code points. 2. All percent escapes in documents should be considered to refer to UTF-8 bytes. 3. The browser should convert all IRIs to pure URIs using exclusively UTF-8 percent encoding as specified in the IRI spec. 4. If this fails because the UTF-8 in step 2 is ill-formed, redo step 2 assuming the encoding is ISO-8859-1 and pray. I'm not sure about step 4. Maybe there's better error handling to be done, but steps 1-3 are the only sane approaches to this. (Not that I'm convinced the HTML 5 effort is, in fact, sane, but one lives in hope.) Any scheme that attempts to replicate existing browser URL-encoding behavior is doomed to failure, and will simply relegate us to ASCII only URIs for the foreseeable future. Absent an encoding declaration, there's just no alternative to specifying a single uniform encoding for all URIs. Unless we're sticking with ASCII or 8859-1 (which clearly we shouldn't), that encoding is going to be UTF-8. -- Elliotte Rusty Harold elharo@metalab.unc.edu Refactoring HTML Just Published! http://www.amazon.com/exec/obidos/ISBN=0321503635/ref=nosim/cafeaulaitA
Received on Friday, 27 June 2008 15:04:03 UTC