Re: Error handling in URIs from Ian Hickson on 2008-06-25 (uri@w3.org from June 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Wed, 25 Jun 2008 20:40:06 +0000 (UTC)
To: Charles Lindsey <chl@clerew.man.ac.uk>, Henri Sivonen <hsivonen@iki.fi>, Martin Duerst <duerst@it.aoyama.ac.jp>, "Roy T. Fielding" <fielding@gbiv.com>, Frank Ellermann <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
Cc: URI <uri@w3.org>
Message-ID: <Pine.LNX.4.62.0806252005411.13974@hixie.dreamhostps.com>
On Wed, 25 Jun 2008, Charles Lindsey wrote:
> > 
> > Well there's no question that it's invalid, the question is what should
> > browsers do with it.
> 
> Essentially, it is up to the browser what it accepts.

That's one option, though it's not the way we've done things in HTML5 so 
far (for example we define how to parse any arbitrary byte stream).


> But in the meantime, a sensible strategy for a browser whose pages were 
> published in iso-8859-99 (whatever that might be) to accept IRIs/URIs 
> (and especially queries) %-encoded into iso-8859-99; but also, *in 
> addition* to convert incoming UTF-8 (whether in IRIs or %-encoded in 
> URIs) to its own iso-8859-99.

Well, as noted before, the actual behaviour we need to spec isn't really 
up for debate; browsers have already more or less converged on a 
behaviour. The original question (now answered) was merely which spec 
would define this. (HTML5 now defines it.)


On Wed, 25 Jun 2008, Martin Duerst wrote:
>
> Do you have any idea of what browsers do if they get e.g. an 0x06 byte 
> (ACK)?

Not off-hand, though it's reasonably easy to test it.


> If the latter is current browser behavior, then I'm sure there are a few 
> pages out there that actually depend on that. But how many?

I haven't done a study to examine this yet. In practice though even minute 
numbers end up mattering, given the scale of the Web.


> I think it's also dangerous to go ahead and cast in stone any and all 
> browser quirks (even if they are consistent across the major 
> contenters). Where do you want to draw the line? I know all browsers in 
> the past have made changes that made small numbers of pages stop 
> working.

Generally with HTML5 the line is drawn at what is interoperable; when 
things aren't interoperable we generally pick the more sensible behaviour 
to cast in stone.


> Related, we are now discussing what a browser has to accept, and how to 
> process it, but do you separately say what authors should do and what 
> not?

Yes.


> Can you tell me what UAs do if you just past such an IRI into the 
> address bar? Where do they get the encoding from?

You can try it -- IE encodes the path as UTF-8 and then %-escapes it, and 
the encodes the query component using Windows-1252 (probably the platform 
default encoding) and leaves the address with raw bytes.

Safari and Mozilla encode both as UTF-8 and %-escape both.

Opera encodes the path as UTF-8 and the query as Windows-1252 (I think) 
and then %-escapes everything.


> Also, do you realize that a few years ago, many browsers interpreted the 
> path part in the encoding of the page? Thinking things through showed 
> that this wouldn't work (URIs/IRIs have to work on their own, not only 
> in a page context). That's why we wrote the IRI spec, and that's why 
> browsers started to change. The argument of having to work independently 
> from a page applies to query parts, too.

Why did browsers change their behaviour for paths and not query 
components?


> Anybody who wants a query part in an URI to be in a native encoding 
> (because that's what their server-side script expects) can use 
> %-encoding, the same anybody who wants a path part to be in a native 
> encoding (because that's what their server responds to) already uses 
> %-encoding in their path part.

It's not about what authors will do on new pages. It's about how to handle 
legacy, unmaintained, historical documents. If we break them, we 
(humanity) lose part of our legacy. That would be unfortunate.


> Otherwise, please tell me what the URI is that a client should send
> to the server for an URI such as http://example.org?������

If the page encoding is UTF-8:

   /?%E3%81%82%E3%81%84%E3%81%86


On Wed, 25 Jun 2008, Roy T. Fielding wrote:
> >
> > Standards, for the purposes of the HTML5 effort, are comprehensive 
> > documentation intended to make it possible to implement user agents, 
> > and are thus very much not abstractions.
> 
> That is obviously the definition of an implementation specification, not 
> a standard.

Ok. HTML5 is an implementation specification.


> > This isn't intended to disparage other beliefs or opinions as to what 
> > standards should be. I have no problem with standards that, e.g., 
> > leave error handling undefined -- they are just not really relevant to 
> > the HTML5 work.
> 
> At this rate, the feeling will be mutual.  Why don't you just contribute 
> that documentation to the Mozilla website and be done?

Well, originally HTML5 was just being written on the WHATWG site, as a 
collaboration between Mozilla, Opera, and Apple, and with the input of 
several hundred contributors, which I guess is pretty much the same thing 
as just doing documentation on the Mozilla website and being done. 

However, the W3C asked us to do the work in the W3C space. So I guess 
you'd have to ask the W3C team what their reasoning was.


> STD 66 will never be changed to suit those implementations because there 
> are a hundred that do it right for every one that is wrong (and those 
> numbers improve every week as old code disappears).

Ok.


> How an HTML form constructs a query string is entirely defined by HTML.

Forms are a whole different problem; the issue I was raising was only 
related to simple links.


> The contents of href="whatever" are not a URI -- they are characters 
> that are processed as per SGML CDATA (IIRC) to transform it into a 
> sequence of characters in the document character set, which are then 
> considered by the HTML processor as data for the href attribute 
> (whatever that means, it is defined by HTML, not by URI).  If HTML says 
> that the valid data is limited to a URI in the document character set 
> (which is presumably mapped to ASCII when sent outside the DOM), then 
> the data either conforms to STD 66 or it is invalid.
> 
> What the browser does when it sees invalid data is entirely defined by 
> the browser and (sometimes) its configuration.  It has no relevance 
> whatsoever to the URI specification because it is not and never was a 
> URI.  The URI spec defines identifiers, not href attributes.  The only 
> result that matters is that the invalid data is not used by sending it 
> out of the DOM, such as by sending it as an invalid HTTP request.  
> There is no chance that HTML5 will ever exist as a finished document if 
> it requires the sending of invalid HTTP requests as part of its HTML 
> implementation specification.

Well right now the HTML5 spec goes out of its way to avoid sending invalid 
URIs to servers, though that may have to change depending on what existing 
content depends on.

(HTML5 isn't based on SGML, by the way.)


On Wed, 25 Jun 2008, Frank Ellermann wrote:
> > 
> > could you quote the bits that are nonsensical?
> 
> With difficulties, the memo needs ages to load over a V.90 line, and 
> then ages to run some scripts, until my browser asks me if I want to 
> abort whatever it is

Apologies, I should have provided you with a link to the multipage 
version:

   http://whatwg.org/html5

That should be more usable.


> | A URL is a valid URL if at least one of the following
> | conditions holds: 
> | * The URL is a valid URI reference [RFC3986]. 
> 
> Period, end of story, see STD 66.
> 
> | The URL is a valid IRI reference and it has no query component.
> | [RFC3987] 
> 
> Nope, that's an IRI, not an URL (matched in bullet 1).
> 
> | The URL is a valid IRI reference and its query component
> | contains no unescaped non-ASCII characters. [RFC3987] 
> 
> That's also an IRI, not an URL (matched in bullet 1).
> 
> There is also nothing special with query parts using unescaped 
> characters, at least not in RFC 3987.
> 
> | The URL is a valid IRI reference and the character encoding
> | of the URL's Document is UTF-8. [RFC3987] 
> 
> That's also an IRI, not an URL (matched in bullet 1).

I believe the confusion here is that the term "URL" as used in the HTML5 
spec is intended to be a term independent of the term "URL" as used in the 
URI spec. I hadn't realised until recently that the URI spec actually 
defines the term URL as well, and had thought "URL" to be an undefined 
informal term.

I need a term to use in the HTML5 spec which means "a string used to 
identify a resource", and which can then be defined to be valid if it 
matches the conditions listed above. The term has to be one that authors 
would immediately recognise as being intended to be URI-like, yet without 
conflicting with existing definitions. If you disagree with the use of the 
term "URL" for this purpose, do you have any alternative suggestions?


> Maybe use "IRL", the IRI spec. doesn't use it.  Apparently what you 
> really want is a new variant of IRI, with special rules for <iquery> 
> parts in non-UTF-8 documents.

I think people would be more confused by the use of the term "IRL" than 
"URL" (with the exception of people intimiately familiar with the URI 
spec). Maybe the term "address" would work?


> > It's true that this is requiring defining things that are at odds with 
> > existing specifications, but that's mostly because those 
> > specifications aren't in fact in line with real usage.
> 
> "Real usage" is not only what numerous broken Web pages do, or what a 
> few browsers guess.  Broken URLs have caused real damage last year:
> 
> http://www.microsoft.com/technet/security/advisory/943521.mspx
> http://www.heise-security.co.uk/news/97878

Right, that's why defining error handling is critical, and why a spec that 
doesn't define error handling is, frankly, irresponsible. By defining 
error handling, we help guarantee that any input results in a known, 
predictable, and most importantly _safe_ behaviour.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 25 June 2008 20:40:44 UTC