Re: Error handling in URIs

On Thu, 26 Jun 2008, Frank Ellermann wrote:
> > 
> > browsers have already more or less converged on a behaviour.
> 
> But that behaviour is wrong, because it cannot work reliably, outside of 
> "if it is not UTF-8 then it must be iso-8859-1, redefined to be 
> windows-1252 in HTML5" scenarios.

Whether it's right or wrong is neither here nor there, frankly.

It can work reliably insofar as all user agents can do the same thing, 
which is what we're aiming for in the HTML5 effort.


> > Safari and Mozilla encode both as UTF-8 and %-escape both.
> 
> Sounds like they got this right, didn't they ?

This was in the context of copied-and-pasted URLs, which is user 
interface, for which interoperability isn't a big deal (at least not 
compared to handling actual legacy content).


> > It's about how to handle legacy, unmaintained, historical documents. 
> > If we break them, we (humanity) lose part of our legacy. That would be 
> > unfortunate.
> 
> It would be also a red herring for IRIs specified in RFC 3987 only 3.5 
> years ago, not permitted in HTML 4 or XHTML 1 pages.

There are pages that aren't UTF-8 encoded that contain links with 
non-ASCII characters in query components. Whether those pages existed 
before or after the IRI spec did isn't really relevant. What's important 
is that those pages exist and browsers don't want to break them -- and 
that means that if I want my spec to not be ignored, I have to take them 
into account and support them.


> If we are talking about method="get" forms and corresponding IRIs with 
> an <iquery> 'human legacy' is an obscure argument - but I don't see 
> what's wrong with what Safari and Mozilla do.

Forms are a whole different problem. It's links that are of concern here.


> > Ok. HTML5 is an implementation specification.
> 
> Better split the parts where it's a document type definition for 
> authors, the audience is far too different.  If you tell authors what 
> they can get away with they won't see the point of say "<s> is 
> deprecated" vs. "interpret <s> as <del>".

Yeah, that's on the cards for when the spec is more stable (we'll probably 
generate two or three documents automatically for different audiences).


>  [IRL proposal]
> > I think people would be more confused by the use of the term "IRL" 
> > than "URL" (with the exception of people intimiately familiar with the 
> > URI spec). Maybe the term "address" would work?
> 
> If you are sure that you don't need "address" for something else it is 
> fine.  IE-fans would know what you are talking about.  And I finally got 
> used to the idea that "address" means what I know as "location".
> 
> In the direction of:  "An 'address' is the URI (STD 66) derived from a 
> valid IRI (RFC 3987) or invalid constructs as specified below" (etc.)

It was brought to my attention on IRC that "address" is probably as 
overloaded as "URL" so this might not be a step forwards for the spec, 
just a step sideways. I'll see what can be done though. It might be that 
the spec just uses the term "URL" and ignores the URI spec's definition of 
the term. Most people seem to understand the intent, as far as I know 
you're the only person whom this has confused.


> >> Broken URLs have caused real damage last year:
> >> http://www.microsoft.com/technet/security/advisory/943521.mspx
> >> http://www.heise-security.co.uk/news/97878
> > 
> > Right, that's why defining error handling is critical, and why a spec 
> > that doesn't define error handling is, frankly, irresponsible. By 
> > defining error handling, we help guarantee that any input results in a 
> > known, predictable, and most importantly _safe_ behaviour.
> 
> IMHO you could leave this at "MUST NOT be interpreted as URI" or 
> similar, but that might be a matter of taste.

Well, we could say that, but then browser vendors would ignore us. I don't 
want browser vendors to ignore us.


> Are you going to specify the exact error handling for say surrogates and 
> overlong encodings in UTF-8 ?  I'd have ideas about this, but I don't 
> see that it belongs into a HTML5 specificaton.

These issues were brought to the attention of the Unicode consortium, who 
are looking into addressing these error handling issues in their specs.

I agree entirely that this kind of error handling stuff shouldn't be in 
HTML5. The only times HTML5 defines error handling for things outside the 
"HTML" language itself is when the relevant specs don't define their own 
error handling, and the relevant groups refuse to do anything about it.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Wednesday, 25 June 2008 22:36:18 UTC