Re: Error handling in URIs from Ian Hickson on 2008-06-24 (uri@w3.org from June 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Tue, 24 Jun 2008 21:08:00 +0000 (UTC)
To: Julian Reschke <julian.reschke@gmx.de>
Cc: Anne van Kesteren <annevk@opera.com>, uri@w3.org
Message-ID: <Pine.LNX.4.62.0806242101240.13974@hixie.dreamhostps.com>

On Tue, 24 Jun 2008, Julian Reschke wrote:
> > >
> > > You could change the algorithm how to get to the IRI in the first 
> > > place, such as making it equivalent to:
> > > 
> > >  <a href="results.cgi/&#x017d;?&#xde;">
> > > 
> > > ...in which case the standard IRI->URI conversion would yield the 
> > > expected result.
> > 
> > I'm not really sure what that would look like, compared to what I have 
> > now. Could you elaborate?
> 
> 1. Consider the input an IRI
> 
> 2. Convert non-ASCII characters in the query part to URI characters by 
> encoding them in the document characters set, then percent-escaping
> 
> 3. Go on with regular IRI->URI conversion.
> 
> Of course that's almost the same as re-doing all the work done in the 
> IRI spec, but at least you wouldn't need to worry about IDN stuff.

Yeah, we could do that I guess. I think it ends up being simpler for 
implementors if we just add the paragraph about IDN and make them 
implement URIs instead of URIs and IRIs. One fewer layer of abstraction.


> > The situation is far from perfect, indeed. That's why we need specs 
> > that define error handling, to avoid this mess where Web content 
> > relies on unspecified issues and forces interoperability through 
> > reverse-engineering. (In this particular case, the differences between 
> > IE and the other browsers don't matter much because sites tend to only 
> > use one encoding, so the encoding source doesn't matter, and tend to 
> > convert %-escaped bits into their equivalent 8 bit octets before 
> > processing them, so they see the 8-bit URIs and the %-escaped URIs as 
> > equivalent.)
> 
> As long as no intermediate re-encodes the resource.

Yeah, there are a lot of edge cases where the behaviours are noticeably 
different (just look at the dozen plus test cases I mentioned earlier).


> > > > > Now, that being said, is there anything HTML5 could do so we can 
> > > > > get closer to a strict UTF-8 world in the future? Such as 
> > > > > allowing servers to serve document in an encoding != UTF-8, but 
> > > > > still get query parameters to be consistently encoded in UTF-8?
> > > >
> > > > There might be, but I don't see any way to get there at the 
> > > > moment. Any suggestions would be very welcome.
> > >
> > > A form attribute through which the site can state: "I want 
> > > UTF-8-encoding-then-percent-escaping, no matter what the document 
> > > encoding was"?
> > 
> > We have that already. It doesn't really help regular links.
> 
> Regular links aren't a problem (if I understand "regular" correctly), 
> because the site owner generated them.

Regular links are the only thing I'm concerned about at the moment.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 24 June 2008 21:08:39 UTC