Re: Error handling in URIs from Frank Ellermann on 2008-06-25 (uri@w3.org from June 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Wed, 25 Jun 2008 06:04:11 +0200
To: uri@w3.org
Message-ID: <g3sg0h$dp7$1@ger.gmane.org>
Ian Hickson wrote:

>> It is technically not possible to define something else without
>> running into logical problems like your Safari + IE examples.
 
> Logic sadly doesn't have much to do with the way the Web works. :-(

Sure, but just because everybody does odd things in practice does
not necessarily mean that this needs to be noted in a standard.

If they agree on an oddity, maybe, but if not, let them do what
they wish.  A standard is an abstraction.  Not a collection of
observed behaviour divided by statistics resulting in MUST at 80%.

I think that is one of the problems I have when looking into an
HTML 5 draft:  Some choices appear to be arbitrary, as they are
not logical.   

> Right, that's why I was hoping we could update the URI spec.
> However, you suggest above that how to handle these errorneous 
> addresses is an issue for the HTML spec and not the URI spec,
> so I'm not sure what you are actually suggesting.

Sorry, that was indeed unclear:  For XHTML 1 doctypes I'd know
that href= wants an RFC 2396 URI, so I'd conclude that this is
old, and if they ever update it they will say STD 66.

For HTML 5 you will say that href= wants an RFC 3987 IRI, but
you could also say that spaces are no problem, a kind of LEIRI,
for href=.  You could also decide that URI is good enough, as
it works everywhere, and IRI-producers would know how to get
an equivalent URI in the href, while URI consumers might not 
know what a native IRI, let alone LEIRI, is.

E.g., FF2 gets convoluted <ihost>s right, but fails or failed 
for the simple test of an <ipath> in an iso-8859-1 document.
That is an FF2 bug, not something you want in the HTML5 spec.

> "%%x" and "%xx" aren't valid escape sequences

ACK, I missed the %%, and I was too lazy to check ##.  Right,
"#" is not permitted, only "?" and "/" are okay.  

> The question is what should a browser do with that document.

Garbage in, garbage out.  For security reasons ignoring broken
URIs might be best.  The example was about an http: URI, let
RFC 2616 and 3986 talk about scheme specific stuff (RFC 3986
is general, but for http also specific).  Or rather it was a
wannabe IRI because HTML5 says so, but RFC 3987 has a normative
reference to 3986 for these details.  Adding its own RFC 3987
security considerations - you can of course copy what you like
to emphasize in the HTML5 spec.  How about this:

"If an URI does not match the generic syntax in [RFC3986] it
 is invalid, and broken URIs can cause havoc."  

> The choices are to define this primarily in the *RI specs,
> or to define it primarily in the HTML5 spec. Right now I'm
> picking the latter

URIs just have their own generic and specific specifications.
Good enough, but if you know cases where you want to recommend
a specific error handling...
    
> Error handling isn't an implementation detail when 90% of the
> input to the implementations are invalid, as on the Web.

...if it causes harm, and you know how to avoid it, go for it.
But make sure that you don't end up with *redefining* what is
and what is not a valid xyz (URI, IRI, UTF-8, XML, PNG, etc.)

 Frank
Received on Wednesday, 25 June 2008 04:17:37 UTC