Re: Error handling in URIs from Martin Duerst on 2008-06-24 (public-iri@w3.org from June 2008)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Tue, 24 Jun 2008 20:01:49 +0900
To: Ian Hickson <ian@hixie.ch>, uri@w3.org
Cc: public-iri@w3.org
Message-Id: <6.0.0.20.2.20080624194840.09856b40@localhost>
At 19:09 08/06/24, Ian Hickson wrote:
>
>
>Hi,
>
>I recently started addressing issues related to URIs in the context of the 
>HTML5 specification. In general I am trying to defer as much as possible 
>to the URI, IRI, IDN, and XML Base specifications, but there are a couple 
>of issues that are left undefined by those specifications which I am 
>having trouble with.
>
>The first is error handling behaviour for URIs. Browsers are reasonably 
>consistent in their handling of invalid URI references such as:
>
>   http://example.com/hello world/
>
>...or:
>
>   {{%%xx##
>
>...but the URI specification just says that these URI references are 
>invalid and doesn't really say what to do with them.

Yes. For the first example above, most browsers probably just
covert the space to a %20, but there might be exceptions.
I think this is indeed the job of the HTML spec to specify
what is expected, either an error or some other behavior.
It's unreasonable to expect that the URI or IRI specs could
do that on behalf of the HTML spec or some other specs, because
there may be technologies that want to be rather permissive,
and others that want to be very strict.

For the second case, that's total garbage, and shouldn't
resolve. The URI/IRI specs would already tell you that
(illegal would mean that it doesn't resolve, or wouldn't it),
and I just hope HTML5 works things out so that you don't
suddenly come up with something that would work for that
case.


>The second is with IRIs and character encodings other than UTF-8. While 
>browsers reliably encode non-ASCII characters in the path using UTF-8, 
>non-ASCII characters in the query component are encoded using the 
>document's character encoding, and not UTF-8, which is incompatible with 
>how the IRI spec defines things.

This is indeed a point that's missing from the current IRI spec,
and that I plan to add (the draft just expired a few days ago,
and I hope to put out a new one maybe over next weekend).

At the moment, it seems clear to me that IRIs put into the
address field should be handled as all-UTF-8 (because there
is no encoding context around), and that data comming from
form fields should be handled as being in the page encoding
(in that case, it can be thought to be converted to %HH
before it even gets put together, so it never actually might
be an IRI), but the case of the query part in IRIs inside a
document is indeed a tough one: On the one part, many things
work best if you use page encoding for that query part; on
the other hand, once you take that IRI out of the page, you
loose the encoding context.

I think the most consistent thing to do would be to say:
Either use %HH encoding or accept the fact that the browser
will send it back to you as UTF-8. This way, IRIs with query
parts would be treated like everything else.

Any feedback appreciated. I'm crossposting to public-iri@w3.org
so that this discussion gets logged there, too.

>Is there any chance that the URI and IRI specifications might get updated 
>to handle these issues?

Yes for the IRI spec. Most probably no for the URI spec.

Regards,   Martin.

>At the moment, I'm working around these issues by "wrapping" the URI specs 
>with pre- and post- processing steps and by requiring that implementations 
>use slightly different definitions for the ABNF productions, which is 
>rather dubious. You can see this work in progress here:
>
>   http://www.whatwg.org/specs/web-apps/current-work/#urls
>
>(It's woefully incomplete.) It would be much cleaner if instead HTML5 
>could just defer to the URI specs for everything URI-related.
>
>Cheers,
>-- 
>Ian Hickson               U+1047E                )\._.,--....,'``.    fL
>http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
>Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Received on Wednesday, 25 June 2008 01:50:51 UTC