Re: the "HTML URL" issue, was: Why Microsoft's authoritative=true won't work and is a bad idea

On Jul 8, 2008, at 12:27 AM, Julian Reschke wrote:
> Henrik Nordstrom wrote:
>> On mån, 2008-07-07 at 18:56 -0400, Justin James wrote:
>>> The problem with the concept of HTML specifying its own URLs,  
>>> from my
>>> viewpoint, is that developers need one standard to follow, not 3  
>>> (URI,
>>> IRI, HTTP URL).
>> But I am still not aware of the problem which triggered this. I  
>> linger
>> on the HTTP WG, not the HTML one.. and is therefore unaware of what
>> problem HTTP URL/URI/IRI specifications cause for HTML.
>> ...
>
> See thread at <http://lists.w3.org/Archives/Public/uri/2008Jun/ 
> 0088.html>.
>
> Key issues:
>
> 1) there are non-IRI identifiers in HTML in use (such as using  
> space characters)

No, there aren't.  The contents of the attribute value is CDATA, not  
an IRI.
How the parser converts the CDATA to a URI string (not IRI string)  
should
be defined by HTML.  The algorithm doesn't even need to be the same for
different element attributes (e.g., some attribute values consist of
space-separated references).  The value doesn't become identifier(s)
until after the conversion of CDATA string to valid URI is complete.

> 2) UAs do not use UTF-8 consistently when mapping non-ASCII  
> characters in query parameters (they may use the document encoding  
> instead)

That's because UTF-8 was not a desired mapping when HTML was defined.
That's why HTML maps query parameters to the document encoding.
I don't see why this is even being argued, since it certainly won't
be changing any time soon.  It makes far more sense to encourage the
use of UTF-8 document encodings.

> 3) there is no defined error handling in URI/IRI (I do not agree  
> that this is a problem with URI/IRI)

Of course not, just as there is no defined error handling for the name
on your birth certificate.  Error handling is always defined by context.

> 1) and 2) can be solved by defining a transformation from HTML URL  
> to IRI. HTML5 currently modifies the parsing rules of IRI instead,  
> which I think is the wrong approach.

The whole discussion is just brain dead.  All of the supposed issues
are about translating raw data into standardized form.  Instead
of simply defining the transform of raw attribute to standardized value,
which is entirely governed by HTML, the editor has chosen to treat the
raw value as some sort of magic final form, reuses the well-known URL
moniker is the most asinine way, and blames the other standards
(which he thankfully has no control over) for not supporting all of the
possible crappy raw data that could be input in an HTML attribute.

We know that just anything is not interoperable.  That's why URI is
limited to a fairly small set of characters and a simple syntax: to
require WWW identifiers to be in a form that is usable worldwide.
That's why HTTP identifiers are limited to URIs.  That's why this
whole discussion about creating new identifiers and new protocols in
HTML is a total waste of time -- the rest of the world does not want
it and will not allow it to be published as HTML5.  Pound the sand
all you like; the network standards will not change because they are
designed to support everyone's needs, not just the selfish desires of
a very small set of browser developers.

....Roy

Received on Tuesday, 8 July 2008 22:13:18 UTC