Re: Error handling in URIs from Julian Reschke on 2008-06-24 (uri@w3.org from June 2008)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Tue, 24 Jun 2008 21:57:22 +0200
To: Ian Hickson <ian@hixie.ch>
CC: uri@w3.org
Message-ID: <486151A2.8000307@gmx.de>
Ian Hickson wrote:
> On Tue, 24 Jun 2008, Julian Reschke wrote:
>> Ian Hickson wrote:
>>>> Could you please be more specific? Any URI is a IRI, so a query component
>>>> based on an encoding other than UTF-8 still is a legal IRI.
>>> The IRI spec would have the query component always encoded as UTF-8, as I
>>> understand it.
>> IRIs consist of Unicode characters. UTF-8 only enters the picture when 
>> an IRI is converted to a URI.
>>
>> If you start with a URI *or* IRI, and then append query parameters, you 
>> always have the choice not to use non-ASCII characters, and to decide 
>> yourself what character encoding to use before percent-escaping.
> 
> Sure. But in this document:
> 
>    <!DOCTYPE HTML>
>    <title>Test</title>
>    <meta charset="ISO-8859-13">
>    <a href="results.cgi/&#x017d;?&#x017d;">Link</a>
> 
> ...what is the link? It's not a URI, as it contains non-ASCII characters. 

Correct.

> It could be an IRI, but compatibility with Web content requires that it 

Is *is* a IRI.

> not be treated per the IRI spec. Safari, for instance, will fetch the 

...not be converted to a URI per the IRI spec...

> following URI (assuming the base URL is http://example.com/):
> 
>    http://example.com/results.cgi/%C5%BD?%DE
> 
> It yields a valid URI, but somewhere we have to define the processing that 
> led to two characters in the same URL being encoded using two different 
> character encodings. Right now, URI and IRI don't define this, so I'm 
> defining it in the HTML5 spec. This is unfortunately requiring very 
> intimate interaction with the parsing rules of URIs, which is far less 
> orthogonality than I would like.

You could change the algorithm how to get to the IRI in the first place, 
such as making it equivalent to:

  <a href="results.cgi/&#x017d;?&#xde;">

...in which case the standard IRI->URI conversion would yield the 
expected result.

> IE actually sends http://example.com/results.cgi/%C5%BD?* where "*" is 
> the ISO-8859-13-encoded 8-bit byte for that character. If you target an 

Now that suggests to me that there is no interop between IE and Safari, 
and thus whatever you specify *may* break something.

> iframe, IE uses the encoding of the iframe. You can see some of these 
> behaviours if you look at these tests:
> 
>    http://hixie.ch/tests/adhoc/uri/encoding/
> 
> 
>> Now, that being said, is there anything HTML5 could do so we can get 
>> closer to a strict UTF-8 world in the future? Such as allowing servers 
>> to serve document in an encoding != UTF-8, but still get query 
>> parameters to be consistently encoded in UTF-8?
> 
> There might be, but I don't see any way to get there at the moment. Any 
> suggestions would be very welcome.

A form attribute through which the site can state: "I want 
UTF-8-encoding-then-percent-escaping, no matter what the document 
encoding was"?

Or potentially, in a more distant future, some way of specifying URI 
templates (*)?

BR, Julian

(*) Yes, when they are ready...
Received on Tuesday, 24 June 2008 19:58:13 UTC