Re: URIs in HTML5 and issues arising from Julian Reschke on 2008-06-30 (uri@w3.org from June 2008)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Mon, 30 Jun 2008 07:24:59 +0200
To: Ian Hickson <ian@hixie.ch>
CC: uri@w3.org, HTML WG <public-html@w3.org>
Message-ID: <48686E2B.4020403@gmx.de>
Ian Hickson wrote:
> On Sun, 29 Jun 2008, Julian Reschke wrote:
>> Ian Hickson wrote:
>>>> Fair enough.  Use "HTML URL" a few times, then, particularly in the 
>>>> context of the definition of validity.
>>> It was pointed out that "HTML URL" would also be misleading, since 
>>> there are already spec writers looking to use these definitions 
>>> elsewhere.
>> Not sure why this means it can't be called "HTML URL".
> 
> Because it would be even more confusing to have non-HTML specs talk about 
> their URLs being HTML URLs.

First I'd like to see which specs are affected.

>>>> Interesting. If so that's a flat-out browser bug and should be 
>>>> fixed.
>>> That's nice in theory, but content depends on this behaviour now.
>> How much? It would be nice to make this decision based on reliable 
>> information, because it's an expensive one for the future.
> 
> Philip has already posted numbers, and I cited them in the e-mail to which 
> you replied. If you're not going to do research yourself, the least you 
> could do is read the e-mails to which you are replying completely before 
> asking that other people do the research for you.

Here I was asking for a different number (the amount of pages that use 
non-ASCII characters in queries, which are *also* not included in the 
document's encoding).

>>> Having had to deal with content in mixed encodings before, I disagree 
>>> that it's better. At least with data loss you get much quicker 
>>> feedback that something went wrong.
>> How do you know that it is data loss?
> 
> It's pretty obvious when you paste a URL into a document and then click it 
> to see if it worked that it didn't work if it goes to a page you're not 
> expecting, all the more so when it does so because the data got converted 
> into question marks.

With question marks, there will be data loss. You may or may not notice 
it, because the page you get may look ok (for instance, it depends on 
how important that part of the query was). If you notice that something 
is wrong, then, yes, spotting the question mark may help. If you 
understand the issue itself. For how many users is that the case?

With UTF-8/percent-escaping, the page may very well work as desired, 
because the server happens to understand that encoding (see Google case 
cited in Webkit bug report).

Finally, if you copy & paste the URL, you wouldn't see the replacement 
characters anyway, right? In which case the default handling (using 
UTF-8) would apply; which even more is a reason to consider making this 
mandatory (because otherwise following the link inside the document and 
the copy/paste case yield different results).

>> I care because I'd like to see documents using non-ASCII characters in 
>> query parts become compliant no matter what encoding they are in.
> 
> Unless we change the definition of HTML5's URLs to be conforming even when 
> those URLs would not be treated as IRIs, I don't see any way to get there 
> from here.

We could break the affected pages and/or add a mechanism through which 
pages can opt-in into the sane UTF-8 based behavior.

>> Whether or not RFC 3986 defines "URL" is really not the point. If it 
>> didn't, another, earlier RFC would.
> 
> Terminology defined in obsolete URLs would be even less of an issue 
> though.

Oh my. In that case these RFCs wouldn't be "obsolete", by definition.

> ...
>> [asking vendors]
>>
>> It would be nice to see these kinds of discussions being part of the 
>> working group process, so that the other WG members can actually see 
>> what was being proposed, and what the answer was.
> 
> The HTMLWG is only a small part of the broad range of places from which I 
> take input, which includes hundreds of blogs, at least three separate bug 
> systems, multiple other mailing lists, face to face discussions, IRC 
> conversations on dozens of channels and privately, private e-mails, etc. I 
> try to keep as much of the discussions to the HTMLWG and WHATWG lists, but 
> the sheer volume of traffic that would be generated by archiving all the 
> sources of input on public-html would be staggering, and that's without 
> even considering whether all those people would actually be willing to 
> have their input forwarded in that way.

In which case it seems to me we have a big process problem.

BR, Julian
Received on Monday, 30 June 2008 05:25:44 UTC