W3C home > Mailing lists > Public > uri@w3.org > June 2008

Error handling in URIs

From: Ian Hickson <ian@hixie.ch>
Date: Tue, 24 Jun 2008 10:09:53 +0000 (UTC)
To: uri@w3.org
Message-ID: <Pine.LNX.4.62.0806240956520.13974@hixie.dreamhostps.com>


I recently started addressing issues related to URIs in the context of the 
HTML5 specification. In general I am trying to defer as much as possible 
to the URI, IRI, IDN, and XML Base specifications, but there are a couple 
of issues that are left undefined by those specifications which I am 
having trouble with.

The first is error handling behaviour for URIs. Browsers are reasonably 
consistent in their handling of invalid URI references such as:

   http://example.com/hello world/



...but the URI specification just says that these URI references are 
invalid and doesn't really say what to do with them.

The second is with IRIs and character encodings other than UTF-8. While 
browsers reliably encode non-ASCII characters in the path using UTF-8, 
non-ASCII characters in the query component are encoded using the 
document's character encoding, and not UTF-8, which is incompatible with 
how the IRI spec defines things.

Is there any chance that the URI and IRI specifications might get updated 
to handle these issues?

At the moment, I'm working around these issues by "wrapping" the URI specs 
with pre- and post- processing steps and by requiring that implementations 
use slightly different definitions for the ABNF productions, which is 
rather dubious. You can see this work in progress here:


(It's woefully incomplete.) It would be much cleaner if instead HTML5 
could just defer to the URI specs for everything URI-related.

Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Tuesday, 24 June 2008 10:10:30 UTC

This archive was generated by hypermail 2.4.0 : Sunday, 10 October 2021 22:17:51 UTC