Re: Error handling in URIs from Ian Hickson on 2008-06-25 (uri@w3.org from June 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Wed, 25 Jun 2008 02:11:25 +0000 (UTC)
To: Frank Ellermann <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
Cc: uri@w3.org
Message-ID: <Pine.LNX.4.62.0806250201580.13974@hixie.dreamhostps.com>
On Wed, 25 Jun 2008, Frank Ellermann wrote:
> Ian Hickson wrote:
> 
> > <!DOCTYPE HTML>
> > <title>Test</title>
> > <meta charset="ISO-8859-13">
> > <a href="results.cgi/&#x017d;?&#x017d;">Link</a>
>  
> > ...what is the link?
> 
> It is whatever the unspecified "HTML" document type definition says.

Ok.


> So this is an IRI, no URI, and invalid in document types permitting only 
> URIs.

Well there's no question that it's invalid, the question is what should 
browsers do with it.


> That the HTML 4 spec. is at best fuzzy about this is one of the reasons 
> why you want HTML5, isn't it ?

Indeed.


> > Safari, for instance, will fetch the following URI
> > (assuming the base URL is http://example.com/):
> 
> >    http://example.com/results.cgi/%C5%BD?%DE
> 
> Trying to be smart... :-( If it can deal with UTF-8, and obviously that 
> is the case, it should better try %C5%BD also as query.  Otherwise the 
> server has no good chance to figure what is going on.

Actually it appears that servers rely on this behaviour, for better or 
for worse.


> > we have to define the processing that led to two characters in the 
> > same URL being encoded using two different character encodings.
> 
> Just don't, take RFC 3987 "as is".

That's unfortunately not an option.


> It is technically not possible to define something else without running 
> into logical problems like your Safari + IE examples.

Logic sadly doesn't have much to do with the way the Web works. :-(


> > Any suggestions would be very welcome.
> 
> Stay out of trouble.  There is a list about IRIs, and if folks want to 
> update RFC 3987 in wild and wonderful ways they can try their luck on 
> this list.  Follow all standards down to their damned last comma, or 
> update them.  Any attempt to "redefine" standards elsewhere, e.g. 
> directly in HTML5, is doomed.

Right, that's why I was hoping we could update the URI spec. However, you 
suggest above that how to handle these errorneous addresses is an issue 
for the HTML spec and not the URI spec, so I'm not sure what you are 
actually suggesting.


> > Similarly, what should the script in the following example display in 
> > the alert dialog box, assuming a base URL of http://example.com/ ?
>  
> >    <!DOCTYPE HTML>
> >    <title>Test</title>
> >    <p><a href="{{%%xx##">Test</a>
> >    <script>alert(document.links[0].href)</script>
>  
> > Where is this defined?
> 
> The odd DOCTYPE is defined by you, so you're supposed to know what it 
> means.

Ah. I was hoping that your answer would be "The URI spec defines this as 
being [...something]". I had indeed assumed that it was up to the HTML5 
spec to define it, but Julian (and yourself, above) suggested not defining 
it there and defering to the *RI specs, thus my question.


> I'd say for similar document types the href= attribute value is no form 
> of STD 66 URI I can identify at first glance.  Especially curly braces 
> are nowhere permitted in any STD 66 URI, they are no <pchar>.

There's no question that it is invalid, indeed (in addition to the curly 
brace issue, "%%x" and "%xx" aren't valid escape sequences, and "##" is 
not a valid fragment identifier). The question is what should a browser do 
with that document.

The choices are to define this primarily in the *RI specs, or to define it 
primarily in the HTML5 spec. Right now I'm picking the latter, but I'd 
like to be able to just point to the *RI specs, for optimal orthogonality.


> So that is no URI, and better don't ask the poor IANA example.com server 
> for a second opinion.

Indeed, there's no reason to contact any servers to process that document, 
it's purely a question of how to parse and resolve the address.


> It is also no IRI, and percent-encoding the curly braces doesn't help - 
> unless you think it is a cute way to deal with this error condition.

Well, the actual processing that must happen is pretty much fixed by 
legacy documents, that's not really the question either. The question is 
which spec should define it.


> But a specification going too far into implementation details is a 
> matter of taste: I'd love it when I agree, and hate it otherwise.

Error handling isn't an implementation detail when 90% of the input to the 
implementations are invalid, as on the Web. Well-defined error handling in 
such environments is a prerequisite to interoperability.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 25 June 2008 02:12:00 UTC