Re: Error handling in URIs

On Tue, 24 Jun 2008, Martin Duerst wrote:
> >
> >The first is error handling behaviour for URIs. Browsers are reasonably 
> >consistent in their handling of invalid URI references such as:
> >
> >   http://example.com/hello world/
> >
> >...or:
> >
> >   {{%%xx##
> >
> >...but the URI specification just says that these URI references are 
> >invalid and doesn't really say what to do with them.
> 
> Yes. For the first example above, most browsers probably just covert the 
> space to a %20, but there might be exceptions. I think this is indeed 
> the job of the HTML spec to specify what is expected, either an error or 
> some other behavior. It's unreasonable to expect that the URI or IRI 
> specs could do that on behalf of the HTML spec or some other specs, 
> because there may be technologies that want to be rather permissive, and 
> others that want to be very strict.

Fair enough.


> For the second case, that's total garbage, and shouldn't resolve. The 
> URI/IRI specs would already tell you that (illegal would mean that it 
> doesn't resolve, or wouldn't it), and I just hope HTML5 works things out 
> so that you don't suddenly come up with something that would work for 
> that case.

Most HTML UAs do in fact pretty much interoperate on the second one (they 
end up requesting "/%7B%7B%%xx" or "/{{%%xx" from the server over HTTP), 
so it seems likely that at least some content depends on this behaviour. 
Right now HTML5 says to fetch "/%7B%7B%25%25xx", though that might have to 
change to fetch "/%7B%7B%%xx" depending on what content depends on.


> >The second is with IRIs and character encodings other than UTF-8. While 
> >browsers reliably encode non-ASCII characters in the path using UTF-8, 
> >non-ASCII characters in the query component are encoded using the 
> >document's character encoding, and not UTF-8, which is incompatible 
> >with how the IRI spec defines things.
>
> This is indeed a point that's missing from the current IRI spec, and 
> that I plan to add (the draft just expired a few days ago, and I hope to 
> put out a new one maybe over next weekend).
> 
> At the moment, it seems clear to me that IRIs put into the address field 
> should be handled as all-UTF-8 (because there is no encoding context 
> around), and that data comming from form fields should be handled as 
> being in the page encoding (in that case, it can be thought to be 
> converted to %HH before it even gets put together, so it never actually 
> might be an IRI), but the case of the query part in IRIs inside a 
> document is indeed a tough one: On the one part, many things work best 
> if you use page encoding for that query part; on the other hand, once 
> you take that IRI out of the page, you loose the encoding context.
> 
> I think the most consistent thing to do would be to say: Either use %HH 
> encoding or accept the fact that the browser will send it back to you as 
> UTF-8. This way, IRIs with query parts would be treated like everything 
> else.

Unfortunately there is definitely content that relies on this as UAs 
implement now. It seems that all we can say is to encode the query 
component using the current encoding, and the path using UTF-8.


> >Is there any chance that the URI and IRI specifications might get 
> >updated to handle these issues?
> 
> Yes for the IRI spec. Most probably no for the URI spec.

Cool. Is there anything I can provide you with to help move some of the 
stuff currently at:

   http://www.whatwg.org/specs/web-apps/current-work/#urls

...into the IRI spec?

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Wednesday, 25 June 2008 13:48:12 UTC