URIs in HTML5 and issues arising

On Wed, 25 Jun 2008, John Cowan wrote:
>
> Ian Hickson scripsit:
> 
> > I think that the confusion of introducing a new term would be greater 
> > than the confusion of reusing URL. People intuitively know what a 
> > "URL" basically is, and they know what an "address" is. They don't 
> > know what an "HRI" is and I think that would work against the 
> > readability of the spec. (This is especially important for this 
> > particular term since it appears all over the place in HTML5.)
> 
> Fair enough.  Use "HTML URL" a few times, then, particularly in the 
> context of the definition of validity.

It was pointed out that "HTML URL" would also be misleading, since there 
are already spec writers looking to use these definitions elsewhere.


On Thu, 26 Jun 2008, Felix Sasaki wrote:
> >
> > It was brought to my attention on IRC that "address" is probably as 
> > overloaded as "URL" so this might not be a step forwards for the spec, 
> > just a step sideways. I'll see what can be done though. It might be 
> > that the spec just uses the term "URL" and ignores the URI spec's 
> > definition of the term.
> 
> There is an alternative to ignoring the URI spec's definition: describe 
> your usage of "URL" and the usage as intended by the URI spec. See a 
> similar problem and a solution for the usage of the terms "URI" and 
> "IRI" mentioned at 
> http://lists.w3.org/Archives/Public/www-tag/2008Jun/0110.html

I've included a note pointing to the distinction. Thanks for the 
suggestion.


On Thu, 26 Jun 2008, Frank Ellermann wrote:
> 
> Intentionally munging IRI and URI is bad for URI consumers with no clue 
> what to do with an IRI.  The <ihost> part is not trivial.  Even the XML 
> 1 spec. got it wrong, ending up with percent-encoded gibberish for a 
> <host>.  This breaks existing software expecting *real* URIs, not IRIs.

This isn't really relevant to HTML5, which basically allows IRIs in the 
content, and converts everything to URIs before sending them to other 
hosts (with possibly some future exceptions to do with incorrect use of 
the the "%" character).


> URL = IRI is newspeak. Why not use the term IRI for IRIs, if that's what 
> it is ?

Because most authors have no idea what an IRI is, and it would be far more 
confusing than the term "URL".


> IRIs are cute for software supporting them.  But authors reading the 
> HTML5 memo need to be aware that this is not everywhere the case.  
> Starting with the W3C validator as popular "legacy" software.

HTML5-compliant UAs will support IRIs, by definition.


On Fri, 27 Jun 2008, Elliotte Harold wrote:
> > 
> > The second is with IRIs and character encodings other than UTF-8. 
> > While browsers reliably encode non-ASCII characters in the path using 
> > UTF-8, non-ASCII characters in the query component are encoded using 
> > the document's character encoding, and not UTF-8, which is 
> > incompatible with how the IRI spec defines things.
> 
> You mean, for instance, when submitting a form using GET?

No, I mean in a regular link.


> Interesting. If so that's a flat-out browser bug and should be fixed.

That's nice in theory, but content depends on this behaviour now.


> Allowing a multiplicity of encodings in URLs is a recipe for 
> interoperability disaster.

We're well past that point.


On Fri, 27 Jun 2008, Elliotte Harold wrote:
> 
> 1. All numeric character references should be considered to point to 
> Unicode code points.

Yes, this is the case.


> 2. All percent escapes in documents should be considered to refer to 
> UTF-8 bytes.

There are too many servers that disagree with this to really get there.


> 3. The browser should convert all IRIs to pure URIs using exclusively 
> UTF-8 percent encoding as specified in the IRI spec.

Due to the problems with 2, this isn't really an option.


> 4. If this fails because the UTF-8 in step 2 is ill-formed, redo step 2
> assuming the encoding is ISO-8859-1 and pray.

That doesn't work in non-Western locales.


> Any scheme that attempts to replicate existing browser URL-encoding 
> behavior is doomed to failure, and will simply relegate us to ASCII only 
> URIs for the foreseeable future.

I'm not sure how you're defining failure. If we don't replicate what 
browsers do, then the browsers will ignore the spec. That's what I 
consider a failure.


> Absent an encoding declaration, there's just no alternative to 
> specifying a single uniform encoding for all URIs.

Existing practice disagrees. :-)


On Fri, 27 Jun 2008, Elliotte Harold wrote:
> > 
> > That's what HTML5 did until about a week ago. The problem is that 
> > doing so leaves a number of behaviours undefined, as far as I can 
> > tell. For example, what should following the link in this example do, 
> > in terms of the actual URI passed to the networking layer?
> > 
> >    <!DOCTYPE HTML>
> >    <title>Test</title>
> >    <meta charset="ISO-8859-13">
> >    <p><a href="results.cgi/&#x017d;?&#x017d;">Test</a>
> 
> Well, I can't do UTF-8 calculations in my head, but assuming that 
> Unicode character 0x017D encodes in three bytes which are AA, BB, and 
> CC, then it should request this path:
> 
> results.cgi/%AA%BB%CC?%AA%BB%CC
> 
> The meta tag is *NOT* considered here.

Legacy Web content disagrees.


On Fri, 27 Jun 2008, Julian Reschke wrote:
> 
> I'm very concerned with the brokenness of the query component; IE 
> creates an invalid HTTP URLs, thus essentially forcing HTTP servers to 
> support those.

The spec requires %-escaping here, which luckily is treated by servers the 
same way as raw octets, so it all works out.


> But even when the document encoding is percent-escaped, there's still an 
> issue when a character in the input "URL" can not be mapped to the 
> document encoding; it would be nice to have a test case for that (or do 
> we?).

The spec says to use a single question mark for those characters. (It's an 
error condition if this happens, anyway, so authors writing conforming 
documents can't get to this point.)


> I'd really like to see HTML5 help getting us to a situation where 
> everything is percent-escaped UTF-8. Basing it just on the document 
> encoding seems to be fragile...

Very fragile, yes. I don't see how to get there from here though.


On Fri, 27 Jun 2008, Philip Taylor wrote:
> 
> We also really shouldn't break existing sites that work perfectly well 
> in current web browsers. E.g. http://www.yildizburo.com.tr/ says
> 
>   <a href="urunlist.php?tur=FAX MAKİNALARI&kategori=Laser Fax"
> class="textmenu">Laser Fax</a>
> 
> encoded in Windows-1254. Clicking that link, Firefox/Opera/Safari go to
> 
>   urunlist.php?tur=FAX%20MAK%DDNALARI&kategori=Laser%20Fax
> 
> while IE goes to
> 
>   urunlist.php?tur=FAX%20MAKİNALARI&kategori=Laser%20Fax
> 
> where the İ is a raw 0xDD byte. Both variations load the correct page.
> 
> Using UTF-8, i.e.
> 
>   urunlist.php?tur=FAX%20MAK%C4%B0NALARI&kategori=Laser%20Fax
> 
> returns a page with no data, which is bad.

Indeed.


> Looking at random pages listed in dmoz.org (which seems quite biased 
> towards English sites), something like 0.5% have non-ASCII characters in 
> <a href> query strings, and (judging by eye) maybe half of those are not 
> UTF-8, so it's a widespread issue and there's no chance of fixing all 
> those sites.

Wow, that's remarkably high.


On Sat, 28 Jun 2008, Julian Reschke wrote:
> Ian Hickson wrote:
> > Actually, while this applies to forms (and WF2 mentions it), it 
> > doesn't seem to apply to regular links, where unencodable characters 
> > just get turned into question marks by IE and Opera. Safari and 
> > Mozilla each do their own thing (&-escape and use UTF-8 respectively) 
> > so I've gone with the more interoperable IE/Opera behaviour in the 
> > spec.
> 
> According to 
> <http://lists.w3.org/Archives/Public/public-html/2008Jun/0358.html>, 
> Safari 3 uses question marks.

According to:

   http://hixie.ch/tests/adhoc/uri/encoding/017.html

Safari trunk uses &-escaping.


> I would think that both data loss (IE/Safari/Opera) and what you call 
> "data corruption" (FF) are bad. As a matter of fact, the latter may be 
> less harmful as servers can try first UTF-8, then document encoding (and 
> I know some servers already do that).

Having had to deal with content in mixed encodings before, I disagree that 
it's better. At least with data loss you get much quicker feedback that 
something went wrong.


> On the other hand, documenting something that is clearly broken seems to 
> be the wrong approach to me, in particular as we have proof that there 
> currently isn't any reliable interoperability for this edge case.

This is error handling (this can't happen for conforming documents), so 
I'm surprised that you have an opinion as to what should happen. :-)

Turning these characters into question marks seems better than the 
alternatives to me. It also matches what IE does, which is usually a good 
sign too.


> It would be interesting to know how many pages out there contain 
> characters in query parts of links that aren't part of the document 
> encoding. Only these would be broken if the more sane FF approach would 
> be used (and these pages may *already* are broken in FF as of today).

Such pages are presumably broken in all browsers today.


On Sat, 28 Jun 2008, Brian Smith wrote:
> > >
> > > Instead, the term "IRI" should be used throughout, except where only 
> > > URIs are allowed. In addition, whenever non-URI IRIs are forbidden, 
> > > there should be an explanation of why they are forbidden.
> > 
> > Since the way that these values are treated doesn't actually follow 
> > IRI rules, I've used the term "URL" instead.
> 
> Don't you see how nonsensical that statement is? What you call a "URL" 
> doesn't follow the rules for URLs either. The "U" in "URL" stands for 
> "uniform" yet this draft demands that we process "URL"s in HTML 
> differently from the way they are processed according to any other 
> specification and differently from the way they are processed by the 
> vast majority of software.

While the term URL may have a meaning given to it by RFC3986, that's not 
the meaning most authors associate with the term. In fact, I think what 
HTML5 now defines it to be is closer to what authors think of URLs as. 
This makes HTML5 easier to read, which is a good thing.


> A specification should be clearly written. Redefining terminology that 
> is already well-known by the reader is confusing and counterproductive.

I agree. However, in this case I don't believe "URL" as per RFC3986 is 
"well known". I think "URL" as per HTML5 is what it is most commonly 
assumed to mean.


> Coining a new term (e.g. "HTML Resource Locator"/"HRL") for what you are 
> trying to do will result in a clearer specification.

I doubt it.


> When somebody sees "URL" they think "Hey, I already know what a URL is." 
> Whenever *I* see "URL" I think "Why can't I use an IRI here?" When 
> somebody sees "HRL" they will think "WTF is a HRL?" which will motivate 
> them to read the definition.

I think you (and others on the uri@w3.org list) are the exception here 
rather than the rule. Even I didn't know that RFC3986 defined "URL" until 
recently, and I read RFCs for a living.



On Sat, 28 Jun 2008, Smylers wrote:
> 
> But most people's concept of precisely what constitutes a URL is pretty 
> fuzzy.  It isn't clear that what they think of on reading "URL" matches 
> the existing definition but not the HTML 5 one.  The nuances between 
> those definitions probably don't even register with many people, meaning 
> that the change doesn't affect them: their general idea of what a URL is 
> matches the HTML 5 definition just as closely as it does the original.

Indeed.


> And I'd've thought that for many people "URL" simply means "the internet 
> address you can type in a web browser" (since this is by far the most 
> common situation in which people encounter URLs) -- in which case, their 
> beliefs about what a URL is comes from browser behaviour, not a spec. 
> For these people the HTML 5 definition is actually an improvement, since 
> it will result in the spec matching their existing beliefs!

I agree.


On Sat, 28 Jun 2008, Julian Reschke wrote:
> 
> That's true, but those people certainly are not the intended audience, 
> for this spec, right?

They are certainly a big part of the intended audience.


On Sat, 28 Jun 2008, Justin James wrote:
> 
> I have not seen a *single* person on this list say, "hey, this is an 
> important distinction at a functional level". Every person involved here 
> (Brian Smith, Julian Reschke, Smylers, Mark Baker, Phillip Taylor, 
> myself) all agree that the distinction is meaningless except in one 
> extremely narrow use case: people with an intimate knowledge of the 
> URL/IRI/URI spec(s) who are implementing something in which the 
> distinction is important.
> 
> I posit that this use case is irrelevantly small; it only seems to apply 
> to people attempting to write applications that implement a particular 
> spec, or maybe people writing an "URIBuilder" type library component or 
> something.

Agreed.


> To "real world" people, this is Yet Another Spec That Shall Be Ignored. 
> By trying to find some way to have all of these slightly different items 
> play nicely with each other, we're dancing around the elephant in the 
> room (I know, Managerial Speak) which is that there should only be one 
> *RI/L spec. PERIOD.

Well, that would be nice, but doesn't appear to be an option, hence HTML5 
taking matters into its own hands, as it were. :-)


> So let's stop this silly dance, get with the *RI/L group, and tell them, 
> "this is broken, please provide us with 1 unified spec that makes 
> sense." But for us to keep trying to Band-Aid the broken *RI/L situation 
> within the HTML spec itself is pretty pointless. *RI/L is meta to HTML, 
> and not within our purview.

I agree.


On Sun, 29 Jun 2008, Julian Reschke wrote:
> > 
> > If valid HTML5 URLs and valid IRIs are equivalent, and invalid HTML5 
> > URLs and invalid IRIs are indistinguishable, then what's the problem?
>
> Valid HTML5 URLs are IRIs.
> 
> Invalid HTML5 URLs get special treatment in the spec (note I'm not 
> arguing against that treatment). The confusion comes from the fact that 
> when the spec says "URL" it really means any URL, not only valid ones.

I don't understand why would that be confusing, except for theoreticians 
who consider invalid documents to not be relevant. I don't think most 
people think "oh, if it's not valid it's not a URL".


> > > Understood. The alternative (which I think should be seriously 
> > > considered) is to break those pages, and to always use UTF-8 for 
> > > encoding.
> > 
> > I can only spec that if browsers are willing to do it. So far, my 
> > understanding is that they are not. (There's no point writing a spec 
> > that isn't followed, the whole point of the spec is to define what 
> > should happen to get interoperability.)
> 
> Understood again, but maybe it makes sense to ask the question again, 
> now that all browser vendors are actually part of the same specification 
> effort.

I've asked the question and been given a negative response. I encourage 
you to follow up on this and see if you can get a better response.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Sunday, 29 June 2008 10:58:22 UTC