RE: HTML5 URL vs. IRI vs. URI... from Phillips, Addison on 2008-08-22 (public-html@w3.org from August 2008)

From: Phillips, Addison <addison@amazon.com>
Date: Fri, 22 Aug 2008 09:27:38 -0700
To: Julian Reschke <julian.reschke@gmx.de>
CC: "'HTML WG'" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA014B0B32FF@EX-SEA5-D.ant.amazon.com>

> >
> > This definition isn't quite complete. "Non-ASCII characters" can
> be escaped in lots of ways using a wide variety of character
> encodings. You should mention the use of UTF-8 to escape non-ASCII
> characters or (better?) reference section 3.1 of IRI (3987).
> 
> > ...
> 
> Although I still disagree with how HTML5 introduces this, I think
> *this*
> part is incorrect. It *does* allow query parameters to be encoded
> using,
> for instance, ISO-8859-1, and it needs to do this for compatibility
> with
> existing content.
> 

I have to admit that I read Section 2.5.1 of [1] as trying to define a 'valid URL' to mean (essentially) an IRI in which the full string uses the UTF-8 encoding. Although I'm very much aware of the 'query part encoding issue', I certainly didn't read the section that way. In fact, I was kind of pleased to think that 'valid URL' meant "IRI from end-to-end". Presumably "invalid URLs" could use non-UTF-8 encoded query parts, which *are* handled by the processing instructions.

If, as you suggest, the third bullet point instead means to allow the query part to use any character encoding (e.g. the document encoding), then the section is problematic. 

1. The phrase "non-ASCII characters" isn't correct. What is meant is that there are no unescaped non-'unreserved' production characters. (See RFC 3986 and yes I mean to cite the URI production here). ASCII covers more characters than are permitted in the query portion of *any* URI. For example, the character '#' is both an ASCII character and one that terminates a query part.

2. The bullet point probably doesn't mean characters anyway. If we intend to allow non-UTF-8 encodings into a "valid URL's" query part, then we should be clear and say 'byte values'. Those byte values may represent characters using some encoding (the document's encoding). But both URI and IRI permit the transmission of non-character data.

So, assuming that 'valid URL' doesn't mean "IRI", the bullet point should read something like:

"The URL is a valid IRI reference [RFC 3987] and its query component contains no unescaped bytes outside the 'unreserved' production in URI [RFC 3986]."

Addison

[1] http://www.w3.org/html/wg/html5/#urls

Received on Friday, 22 August 2008 16:28:23 UTC