Re: Error handling in URIs

On Tue, 24 Jun 2008, Julian Reschke wrote:
> 
> I attend the html-wg's channel when I have time.

Great!


> So I'll assume that there is no problem parsing a URI reference without 
> knowing the base URI's scheme; otherwise you would tell us, right?

As far as I know there are only two issues relating to this (how to handle 
semicolons in hostnames of ftp: URIs vs http: URIs, which appears to be 
something limited to Mozilla's URI parser; and how to handle relative URIs 
when the base URI is a URI that doesn't use a naming authority, e.g. a 
data:, about:, or javascript: URI) but as far as I can tell neither of 
these issues are "problems" in the sense that you mean. I haven't yet 
examined these two issues in enough depth to really know how they affect 
the HTML5 spec yet.


On Tue, 24 Jun 2008, Julian Reschke wrote:
> > 
> > Sure. But in this document:
> > 
> >    <!DOCTYPE HTML>
> >    <title>Test</title>
> >    <meta charset="ISO-8859-13">
> >    <a href="results.cgi/&#x017d;?&#x017d;">Link</a>
> > 
> > ...what is the link? It's not a URI, as it contains non-ASCII characters. 
> 
> Correct.
> 
> > It could be an IRI, but compatibility with Web content requires that it 
> 
> Is *is* a IRI.
> 
> > not be treated per the IRI spec. Safari, for instance, will fetch the 
> 
> ...not be converted to a URI per the IRI spec...

Ok, well, I'm not really worried about the semantics here -- my point is 
just that I can't just defer to the URI and IRI specs wholesale because  
the results that would obtain don't match what the Web needs.


> > following URI (assuming the base URL is http://example.com/):
> > 
> >    http://example.com/results.cgi/%C5%BD?%DE
> > 
> > It yields a valid URI, but somewhere we have to define the processing that
> > led to two characters in the same URL being encoded using two different
> > character encodings. Right now, URI and IRI don't define this, so I'm
> > defining it in the HTML5 spec. This is unfortunately requiring very intimate
> > interaction with the parsing rules of URIs, which is far less orthogonality
> > than I would like.
> 
> You could change the algorithm how to get to the IRI in the first place, such
> as making it equivalent to:
> 
>  <a href="results.cgi/&#x017d;?&#xde;">
> 
> ...in which case the standard IRI->URI conversion would yield the expected
> result.

I'm not really sure what that would look like, compared to what I have 
now. Could you elaborate?


> > IE actually sends http://example.com/results.cgi/%C5%BD?* where "*" is the
> > ISO-8859-13-encoded 8-bit byte for that character. If you target an 
> 
> Now that suggests to me that there is no interop between IE and Safari, and
> thus whatever you specify *may* break something.

The situation is far from perfect, indeed. That's why we need specs that 
define error handling, to avoid this mess where Web content relies on 
unspecified issues and forces interoperability through 
reverse-engineering. (In this particular case, the differences between IE 
and the other browsers don't matter much because sites tend to only use 
one encoding, so the encoding source doesn't matter, and tend to convert 
%-escaped bits into their equivalent 8 bit octets before processing them, 
so they see the 8-bit URIs and the %-escaped URIs as equivalent.)


> > > Now, that being said, is there anything HTML5 could do so we can get 
> > > closer to a strict UTF-8 world in the future? Such as allowing 
> > > servers to serve document in an encoding != UTF-8, but still get 
> > > query parameters to be consistently encoded in UTF-8?
> > 
> > There might be, but I don't see any way to get there at the moment. 
> > Any suggestions would be very welcome.
> 
> A form attribute through which the site can state: "I want 
> UTF-8-encoding-then-percent-escaping, no matter what the document 
> encoding was"?

We have that already. It doesn't really help regular links.


> Or potentially, in a more distant future, some way of specifying URI 
> templates (*)?
> 
> (*) Yes, when they are ready...

Maybe.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 24 June 2008 20:10:39 UTC