Re: Error handling in URIs

On Tue, 24 Jun 2008, Julian Reschke wrote:
> Ian Hickson wrote:
> > > Could you please be more specific? Any URI is a IRI, so a query component
> > > based on an encoding other than UTF-8 still is a legal IRI.
> > 
> > The IRI spec would have the query component always encoded as UTF-8, as I
> > understand it.
> 
> IRIs consist of Unicode characters. UTF-8 only enters the picture when 
> an IRI is converted to a URI.
> 
> If you start with a URI *or* IRI, and then append query parameters, you 
> always have the choice not to use non-ASCII characters, and to decide 
> yourself what character encoding to use before percent-escaping.

Sure. But in this document:

   <!DOCTYPE HTML>
   <title>Test</title>
   <meta charset="ISO-8859-13">
   <a href="results.cgi/&#x017d;?&#x017d;">Link</a>

...what is the link? It's not a URI, as it contains non-ASCII characters. 
It could be an IRI, but compatibility with Web content requires that it 
not be treated per the IRI spec. Safari, for instance, will fetch the 
following URI (assuming the base URL is http://example.com/):

   http://example.com/results.cgi/%C5%BD?%DE

It yields a valid URI, but somewhere we have to define the processing that 
led to two characters in the same URL being encoded using two different 
character encodings. Right now, URI and IRI don't define this, so I'm 
defining it in the HTML5 spec. This is unfortunately requiring very 
intimate interaction with the parsing rules of URIs, which is far less 
orthogonality than I would like.

IE actually sends http://example.com/results.cgi/%C5%BD?* where "*" is 
the ISO-8859-13-encoded 8-bit byte for that character. If you target an 
iframe, IE uses the encoding of the iframe. You can see some of these 
behaviours if you look at these tests:

   http://hixie.ch/tests/adhoc/uri/encoding/


> Now, that being said, is there anything HTML5 could do so we can get 
> closer to a strict UTF-8 world in the future? Such as allowing servers 
> to serve document in an encoding != UTF-8, but still get query 
> parameters to be consistently encoded in UTF-8?

There might be, but I don't see any way to get there at the moment. Any 
suggestions would be very welcome.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 24 June 2008 19:43:43 UTC