- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Wed, 25 Jun 2008 19:44:12 +0900
- To: Ian Hickson <ian@hixie.ch>
- Cc: uri@w3.org, public-iri@w3.org
Hello Ian, At 11:01 08/06/25, Ian Hickson wrote: > >On Tue, 24 Jun 2008, Martin Duerst wrote: >> > >> >The first is error handling behaviour for URIs. Browsers are reasonably >> >consistent in their handling of invalid URI references such as: >> > >> > http://example.com/hello world/ >> > >> >...or: >> > >> > {{%%xx## >> > >> >...but the URI specification just says that these URI references are >> >invalid and doesn't really say what to do with them. >> >> Yes. For the first example above, most browsers probably just covert the >> space to a %20, but there might be exceptions. I think this is indeed >> the job of the HTML spec to specify what is expected, either an error or >> some other behavior. It's unreasonable to expect that the URI or IRI >> specs could do that on behalf of the HTML spec or some other specs, >> because there may be technologies that want to be rather permissive, and >> others that want to be very strict. > >Fair enough. Well, reading some mails in the thread, I got reminded about LEIRIs (they are painful legacy, so it's easy to forget about them :-(. During the LEIRI discussion, one of the things I realized was relevant was what various specs did to URIs/IRIs on top of RFC 3986/7. So I started a page to put information on: http://www.w3.org/International/iri-edit/spec-use-survey.html This is hopefully incomplete, any additions appreciated. The motivations behind LEIRI were manyfold: - A series of specs (XML, XLink, XML Schema,...) that all relied on the same definition of 'IRI' (so it seemed to make sense to deal with this in a central place). - The fact that these specs had this kind of definition came from an earlier (draft) version of the IRI spec (so just saying: "it's all your fault, go fix it" didn't exactly work). - The claim that any iota of effective change to the original specs would be very, very bad, because this was XML, and XML didn't change. - An existing internet draft that misleadingly used the term "Human Readable Resource Identifiers" (as if something with a few C0 characters would be very humanly readable) The LEIRI section is written in a way that allows other specs to participate partially (e.g. by saying "we allow spaces and tabs and newlines in our IRI attributes, but not general C0 crap). Do you have any idea of what browsers do if they get e.g. an 0x06 byte (ACK)? >> For the second case, that's total garbage, and shouldn't resolve. The >> URI/IRI specs would already tell you that (illegal would mean that it >> doesn't resolve, or wouldn't it), and I just hope HTML5 works things out >> so that you don't suddenly come up with something that would work for >> that case. > >Most HTML UAs do in fact pretty much interoperate on the second one (they >end up requesting "/%7B%7B%%xx" or "/{{%%xx" from the server over HTTP), >so it seems likely that at least some content depends on this behaviour. >Right now HTML5 says to fetch "/%7B%7B%25%25xx", though that might have to >change to fetch "/%7B%7B%%xx" depending on what content depends on. Okay, yes, if you take that relative, then you have somewhere to start. If the later is current browser behavior, then I'm sure there are a few pages out there that actually depend on that. But how many? Looking back over the more than 10 years I have been involved in (RFC 2070) or have been watching the HTML spec work, I very much agree that only doing idealistic theory doesn't work. On the other hand, I think it's also dangerous to go ahead and cast in stone any and all browser quirks (even if they are consistent across the major contenters). Where do you want to draw the line? I know all browsers in the past have made changes that made small numbers of pages stop working. Related, we are now discussing what a browser has to accept, and how to process it, but do you separately say what authors should do and what not? The email folks (RFC 2821/2 (bis)) have had reasonable success working on cleaning up some of their mess (in many was comparable to the HTML mess) by very carefully creating "MUST accept, but MUST not generate" obsolete parts of their syntax. >> >The second is with IRIs and character encodings other than UTF-8. While >> >browsers reliably encode non-ASCII characters in the path using UTF-8, >> >non-ASCII characters in the query component are encoded using the >> >document's character encoding, and not UTF-8, which is incompatible >> >with how the IRI spec defines things. >> >> This is indeed a point that's missing from the current IRI spec, and >> that I plan to add (the draft just expired a few days ago, and I hope to >> put out a new one maybe over next weekend). >> >> At the moment, it seems clear to me that IRIs put into the address field >> should be handled as all-UTF-8 (because there is no encoding context >> around), and that data comming from form fields should be handled as >> being in the page encoding (in that case, it can be thought to be >> converted to %HH before it even gets put together, so it never actually >> might be an IRI), but the case of the query part in IRIs inside a >> document is indeed a tough one: On the one part, many things work best >> if you use page encoding for that query part; on the other hand, once >> you take that IRI out of the page, you loose the encoding context. >> >> I think the most consistent thing to do would be to say: Either use %HH >> encoding or accept the fact that the browser will send it back to you as >> UTF-8. This way, IRIs with query parts would be treated like everything >> else. > >Unfortunately there is definitely content that relies on this as UAs >implement now. It seems that all we can say is to encode the query >component using the current encoding, and the path using UTF-8. Can you tell me what UAs do if you just past such an IRI into the address bar? Where do they get the encoding from? Also, do you realize that a few years ago, many browsers interpreted the path part in the encoding of the page? Thinking things through showed that this wouldn't work (URIs/IRIs have to work on their own, not only in a page context). That's why we wrote the IRI spec, and that's why browsers started to change. The argument of having to work independently from a page applies to query parts, too. Anybody who wants a query part in an URI to be in a native encoding (because that's what their server-side script expects) can use %-encoding, the same anybody who wants a path part to be in a native encoding (because that's what their server responds to) already uses %-encoding in their path part. There are things in a spec that are just so because a WG decided it that way, and there are things in a spec that are just so because that's what implementations do, but there are things in a spec that are the way they are because it's the only way that scales, i.e. the only way that works in various contexts. Conversion of IRIs to URIs uniformly (i.e. even for query parts) using UTF-8 is such a case. Otherwise, please tell me what the URI is that a client should send to the server for an URI such as http://example.org?あいう >> >Is there any chance that the URI and IRI specifications might get >> >updated to handle these issues? >> >> Yes for the IRI spec. Most probably no for the URI spec. Sorry, I forgot to mention that the IRI spec just expired late last week, and I was at a conference giving two talks over the weekend, and so I plan to publish a new version over the upcomming weekend. Regards, Martin. >Cool. Is there anything I can provide you with to help move some of the >stuff currently at: > > http://www.whatwg.org/specs/web-apps/current-work/#urls > >...into the IRI spec? #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Wednesday, 25 June 2008 10:45:49 UTC