Re: Error handling in URIs from Roy T. Fielding on 2008-06-25 (uri@w3.org from June 2008)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Wed, 25 Jun 2008 20:19:52 +0200
To: Ian Hickson <ian@hixie.ch>
Cc: URI <uri@w3.org>
Message-Id: <4DD4B169-2ACE-44BB-AED8-2BFA612F86D1@gbiv.com>
On Jun 25, 2008, at 7:33 AM, Ian Hickson wrote:
> Standards, for the purposes of the HTML5 effort, are comprehensive
> documentation intended to make it possible to implement user  
> agents, and
> are thus very much not abstractions.

That is obviously the definition of an implementation specification,
not a standard.

> This isn't intended to disparage other beliefs or opinions as to what
> standards should be. I have no problem with standards that, e.g.,  
> leave
> error handling undefined -- they are just not really relevant to  
> the HTML5
> work.

At this rate, the feeling will be mutual.  Why don't you just contribute
that documentation to the Mozilla website and be done?

> You seem to be conflating the authoring requirements and the user  
> agent
> requirements. The authoring requirements for HTML5 are just "it  
> must be a
> valid URI or IRI". That however has little bearing on what the user  
> agent
> conformance requirements are. The UA requirements have to handle all
> manner of things that _aren't_ valid URIs or IRIs, since in  
> practice such
> invalid content is prevalent.

To answer your original questions, you don't need to know the scheme
of the base URI in order to parse a URI reference.  You do need to know
it to convert a relative reference to an absolute reference, but only to
the extent that you need to know the string in order to copy it.  There
may be a few implementations that do it differently than what has been
defined in STD 66.  I don't care.  STD 66 will never be changed to suit
those implementations because there are a hundred that do it right for
every one that is wrong (and those numbers improve every week as old
code disappears).

How an HTML form constructs a query string is entirely defined by HTML.
The only thing defined by URI in that case is what characters are
allowed in the identifier set, and that's because of what is required
when the URI is sent outside of the HTML-construction context.  HTML
is only one of many hundreds of data formats that use URI.  HTML
cannot change the definition of URIs.

The contents of href="whatever" are not a URI -- they are characters
that are processed as per SGML CDATA (IIRC) to transform it into a
sequence of characters in the document character set, which are then
considered by the HTML processor as data for the href attribute
(whatever that means, it is defined by HTML, not by URI).  If HTML
says that the valid data is limited to a URI in the document
character set (which is presumably mapped to ASCII when sent outside
the DOM), then the data either conforms to STD 66 or it is invalid.

What the browser does when it sees invalid data is entirely defined
by the browser and (sometimes) its configuration.  It has no relevance
whatsoever to the URI specification because it is not and never was
a URI.  The URI spec defines identifiers, not href attributes.  The
only result that matters is that the invalid data is not used by
sending it out of the DOM, such as by sending it as an invalid HTTP
request.  There is no chance that HTML5 will ever exist as a finished
document if it requires the sending of invalid HTTP requests as part
of its HTML implementation specification.

No, it doesn't matter how many different implementations handle
invalid data in different ways.  You can repeat those imaginary
goals of HTML5 til the end of days and it still won't matter.
The right way to handle invalid data is to refuse to use it, where
"use" is entirely dependent on the context where it occurs.  I don't
care what MSIE does with invalid URI references.  I do care what
Firefox, Safari, and WebKit do with invalid URI references, but
only because I prefer to have them highlighted/rejected rather
than used.  The implementations I create refuse or reject invalid
data because to do anything else is going to be a security hole to
someone, somewhere, and it is simply irresponsible to repeat whatever
mistakes were made when hacking Mosaic in 1993.

....Roy
Received on Wednesday, 25 June 2008 18:20:27 UTC