Re: Error handling in URIs from Frank Ellermann on 2008-06-25 (uri@w3.org from June 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Thu, 26 Jun 2008 00:14:25 +0200
To: uri@w3.org
Message-ID: <g3ufsp$qe5$1@ger.gmane.org>
Ian Hickson wrote:

> browsers have already more or less converged on a behaviour.

But that behaviour is wrong, because it cannot work reliably,
outside of "if it is not UTF-8 then it must be iso-8859-1,
redefined to be windows-1252 in HTML5" scenarios.

> Safari and Mozilla encode both as UTF-8 and %-escape both.

Sounds like they got this right, didn't they ?  

> It's about how to handle legacy, unmaintained, historical
> documents. If we break them, we (humanity) lose part of our
> legacy. That would be unfortunate.

It would be also a red herring for IRIs specified in RFC 3987
only 3.5 years ago, not permitted in HTML 4 or XHTML 1 pages.

If we are talking about method="get" forms and corresponding
IRIs with an <iquery> 'human legacy' is an obscure argument -
but I don't see what's wrong with what Safari and Mozilla do.

> Ok. HTML5 is an implementation specification.

Better split the parts where it's a document type definition
for authors, the audience is far too different.  If you tell
authors what they can get away with they won't see the point
of say "<s> is deprecated" vs. "interpret <s> as <del>".  

> the HTML5 spec goes out of its way to avoid sending invalid
> URIs to servers, though that may have to change depending
> on what existing content depends on.

It would also depend on what existing and future servers for
relevant URI schemes expect, including servers implementing
the various protocols - (X)HTML(5) is not the only context
for URIs, and HTTP(S) is not the only URI scheme.

> http://whatwg.org/html5
> That should be more usable.

Yes, thanks, much better.
 
> I believe the confusion here is that the term "URL" as used
> in the HTML5 spec is intended to be a term independent of
> the term "URL" as used in the URI spec.

+1  

 [IRL proposal]
> I think people would be more confused by the use of the term
> "IRL" than "URL" (with the exception of people intimiately
> familiar with the URI spec). Maybe the term "address" would
> work?

If you are sure that you don't need "address" for something else
it is fine.  IE-fans would know what you are talking about.  And
I finally got used to the idea that "address" means what I know as "location".

In the direction of:   "An 'address' is the URI (STD 66) derived
from a valid IRI (RFC 3987) or invalid constructs as specified
below" (etc.)  

>> Broken URLs have caused real damage last year:
>> http://www.microsoft.com/technet/security/advisory/943521.mspx
>> http://www.heise-security.co.uk/news/97878

> Right, that's why defining error handling is critical, and why
> a spec that doesn't define error handling is, frankly, 
> irresponsible. By defining error handling, we help guarantee
> that any input results in a known, predictable, and most
> importantly _safe_ behaviour.

IMHO you could leave this at "MUST NOT be interpreted as URI" or
similar, but that might be a matter of taste.  Are you going to
specify the exact error handling for say surrogates and overlong
encodings in UTF-8 ?  I'd have ideas about this, but I don't see
that it belongs into a HTML5 specificaton. 

 Frank
Received on Wednesday, 25 June 2008 22:13:32 UTC