Re: Error handling in URIs

Hello Ian,

At 11:01 08/06/25, Ian Hickson wrote:
>
>On Tue, 24 Jun 2008, Martin Duerst wrote:
>> >
>> >The first is error handling behaviour for URIs. Browsers are reasonably 
>> >consistent in their handling of invalid URI references such as:
>> >
>> >   http://example.com/hello world/
>> >
>> >...or:
>> >
>> >   {{%%xx##
>> >
>> >...but the URI specification just says that these URI references are 
>> >invalid and doesn't really say what to do with them.
>> 
>> Yes. For the first example above, most browsers probably just covert the 
>> space to a %20, but there might be exceptions. I think this is indeed 
>> the job of the HTML spec to specify what is expected, either an error or 
>> some other behavior. It's unreasonable to expect that the URI or IRI 
>> specs could do that on behalf of the HTML spec or some other specs, 
>> because there may be technologies that want to be rather permissive, and 
>> others that want to be very strict.
>
>Fair enough.

Well, reading some mails in the thread, I got reminded about
LEIRIs (they are painful legacy, so it's easy to forget about them :-(.

During the LEIRI discussion, one of the things I realized was relevant
was what various specs did to URIs/IRIs on top of RFC 3986/7.
So I started a page to put information on:
http://www.w3.org/International/iri-edit/spec-use-survey.html
This is hopefully incomplete, any additions appreciated.

The motivations behind LEIRI were manyfold:
- A series of specs (XML, XLink, XML Schema,...) that all
  relied on the same definition of 'IRI' (so it seemed to make
  sense to deal with this in a central place).
- The fact that these specs had this kind of definition came from
  an earlier (draft) version of the IRI spec (so just saying:
  "it's all your fault, go fix it" didn't exactly work).
- The claim that any iota of effective change to the original
  specs would be very, very bad, because this was XML, and
  XML didn't change.  
- An existing internet draft that misleadingly used the term
  "Human Readable Resource Identifiers" (as if something with
  a few C0 characters would be very humanly readable)

The LEIRI section is written in a way that allows other specs to
participate partially (e.g. by saying "we allow spaces and
tabs and newlines in our IRI attributes, but not general C0 crap).

Do you have any idea of what browsers do if they get e.g. an
0x06 byte (ACK)?

>> For the second case, that's total garbage, and shouldn't resolve. The 
>> URI/IRI specs would already tell you that (illegal would mean that it 
>> doesn't resolve, or wouldn't it), and I just hope HTML5 works things out 
>> so that you don't suddenly come up with something that would work for 
>> that case.
>
>Most HTML UAs do in fact pretty much interoperate on the second one (they 
>end up requesting "/%7B%7B%%xx" or "/{{%%xx" from the server over HTTP), 
>so it seems likely that at least some content depends on this behaviour. 
>Right now HTML5 says to fetch "/%7B%7B%25%25xx", though that might have to 
>change to fetch "/%7B%7B%%xx" depending on what content depends on.

Okay, yes, if you take that relative, then you have somewhere to start.
If the later is current browser behavior, then I'm sure there are
a few pages out there that actually depend on that. But how many?

Looking back over the more than 10 years I have been involved in
(RFC 2070) or have been watching the HTML spec work, I very much
agree that only doing idealistic theory doesn't work. On the other
hand, I think it's also dangerous to go ahead and cast in stone any
and all browser quirks (even if they are consistent across the major
contenters). Where do you want to draw the line? I know all browsers
in the past have made changes that made small numbers of pages stop
working.

Related, we are now discussing what a browser has to accept, and
how to process it, but do you separately say what authors should
do and what not? The email folks (RFC 2821/2 (bis)) have had
reasonable success working on cleaning up some of their mess
(in many was comparable to the HTML mess) by very carefully
creating "MUST accept, but MUST not generate" obsolete parts
of their syntax.


>> >The second is with IRIs and character encodings other than UTF-8. While 
>> >browsers reliably encode non-ASCII characters in the path using UTF-8, 
>> >non-ASCII characters in the query component are encoded using the 
>> >document's character encoding, and not UTF-8, which is incompatible 
>> >with how the IRI spec defines things.
>>
>> This is indeed a point that's missing from the current IRI spec, and 
>> that I plan to add (the draft just expired a few days ago, and I hope to 
>> put out a new one maybe over next weekend).
>> 
>> At the moment, it seems clear to me that IRIs put into the address field 
>> should be handled as all-UTF-8 (because there is no encoding context 
>> around), and that data comming from form fields should be handled as 
>> being in the page encoding (in that case, it can be thought to be 
>> converted to %HH before it even gets put together, so it never actually 
>> might be an IRI), but the case of the query part in IRIs inside a 
>> document is indeed a tough one: On the one part, many things work best 
>> if you use page encoding for that query part; on the other hand, once 
>> you take that IRI out of the page, you loose the encoding context.
>> 
>> I think the most consistent thing to do would be to say: Either use %HH 
>> encoding or accept the fact that the browser will send it back to you as 
>> UTF-8. This way, IRIs with query parts would be treated like everything 
>> else.
>
>Unfortunately there is definitely content that relies on this as UAs 
>implement now. It seems that all we can say is to encode the query 
>component using the current encoding, and the path using UTF-8.

Can you tell me what UAs do if you just past such an IRI into the
address bar? Where do they get the encoding from?

Also, do you realize that a few years ago, many browsers interpreted
the path part in the encoding of the page? Thinking things through
showed that this wouldn't work (URIs/IRIs have to work on their own,
not only in a page context). That's why we wrote the IRI spec, and
that's why browsers started to change. The argument of having to
work independently from a page applies to query parts, too.

Anybody who wants a query part in an URI to be in a native encoding
(because that's what their server-side script expects)
can use %-encoding, the same anybody who wants a path part to be in
a native encoding (because that's what their server responds to)
already uses %-encoding in their path part.

There are things in a spec that are just so because a WG decided
it that way, and there are things in a spec that are just so because
that's what implementations do, but there are things in a spec that
are the way they are because it's the only way that scales, i.e. the
only way that works in various contexts. Conversion of IRIs to URIs
uniformly (i.e. even for query parts) using UTF-8 is such a case.
Otherwise, please tell me what the URI is that a client should send
to the server for an URI such as http://example.org?あいう


>> >Is there any chance that the URI and IRI specifications might get 
>> >updated to handle these issues?
>> 
>> Yes for the IRI spec. Most probably no for the URI spec.

Sorry, I forgot to mention that the IRI spec just expired late
last week, and I was at a conference giving two talks over the
weekend, and so I plan to publish a new version over the upcomming
weekend.

Regards,    Martin.

>Cool. Is there anything I can provide you with to help move some of the 
>stuff currently at:
>
>   http://www.whatwg.org/specs/web-apps/current-work/#urls
>
>...into the IRI spec?




#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp      mailto:duerst@it.aoyama.ac.jp    

Received on Wednesday, 25 June 2008 10:45:49 UTC