W3C home > Mailing lists > Public > public-html@w3.org > June 2008

Re: URIs in HTML5 and issues arising

From: Julian Reschke <julian.reschke@gmx.de>
Date: Sun, 29 Jun 2008 13:29:21 +0200
Message-ID: <48677211.5090204@gmx.de>
To: Ian Hickson <ian@hixie.ch>
CC: uri@w3.org, HTML WG <public-html@w3.org>

Ian Hickson wrote:
>> Fair enough.  Use "HTML URL" a few times, then, particularly in the 
>> context of the definition of validity.
> 
> It was pointed out that "HTML URL" would also be misleading, since there 
> are already spec writers looking to use these definitions elsewhere.

Not sure why this means it can't be called "HTML URL".

 > ...
>> Interesting. If so that's a flat-out browser bug and should be fixed.
> 
> That's nice in theory, but content depends on this behaviour now.

How much? It would be nice to make this decision based on reliable 
information, because it's an expensive one for the future.

> ...
>> According to 
>> <http://lists.w3.org/Archives/Public/public-html/2008Jun/0358.html>, 
>> Safari 3 uses question marks.
> 
> According to:
> 
>    http://hixie.ch/tests/adhoc/uri/encoding/017.html
> 
> Safari trunk uses &-escaping.

The latest release version uses question marks, so it seems that is a 
regression.

>> I would think that both data loss (IE/Safari/Opera) and what you call 
>> "data corruption" (FF) are bad. As a matter of fact, the latter may be 
>> less harmful as servers can try first UTF-8, then document encoding (and 
>> I know some servers already do that).
> 
> Having had to deal with content in mixed encodings before, I disagree that 
> it's better. At least with data loss you get much quicker feedback that 
> something went wrong.

How do you know that it is data loss?

>> On the other hand, documenting something that is clearly broken seems to 
>> be the wrong approach to me, in particular as we have proof that there 
>> currently isn't any reliable interoperability for this edge case.
> 
> This is error handling (this can't happen for conforming documents), so 
> I'm surprised that you have an opinion as to what should happen. :-)

I care because I'd like to see documents using non-ASCII characters in 
query parts become compliant no matter what encoding they are in.

> Turning these characters into question marks seems better than the 
> alternatives to me. It also matches what IE does, which is usually a good 
> sign too.

It doesn't match what Firefox does, which is usually a bad sign too.

>> It would be interesting to know how many pages out there contain 
>> characters in query parts of links that aren't part of the document 
>> encoding. Only these would be broken if the more sane FF approach would 
>> be used (and these pages may *already* are broken in FF as of today).
> 
> Such pages are presumably broken in all browsers today.

In which case I'd argue that we shouldn't specify something that looses 
data.

> ...
>> A specification should be clearly written. Redefining terminology that 
>> is already well-known by the reader is confusing and counterproductive.
> 
> I agree. However, in this case I don't believe "URL" as per RFC3986 is 
> "well known". I think "URL" as per HTML5 is what it is most commonly 
> assumed to mean.

I believe many developers rely on the RFC definitions.

> ...
>> When somebody sees "URL" they think "Hey, I already know what a URL is." 
>> Whenever *I* see "URL" I think "Why can't I use an IRI here?" When 
>> somebody sees "HRL" they will think "WTF is a HRL?" which will motivate 
>> them to read the definition.
> 
> I think you (and others on the uri@w3.org list) are the exception here 
> rather than the rule. Even I didn't know that RFC3986 defined "URL" until 
> recently, and I read RFCs for a living.

Whether or not RFC 3986 defines "URL" is really not the point. If it 
didn't, another, earlier RFC would.

> ...
> On Sat, 28 Jun 2008, Julian Reschke wrote:
>> That's true, but those people certainly are not the intended audience, 
>> for this spec, right?
> 
> They are certainly a big part of the intended audience.

Sorry? People who think '"URL" simply means "the internet  address you 
can type in a web browser"' are the intended audience for this spec? If 
you really think that, I recommend letting those people try to read and 
understand it.

> ...
> On Sun, 29 Jun 2008, Julian Reschke wrote:
>>> If valid HTML5 URLs and valid IRIs are equivalent, and invalid HTML5 
>>> URLs and invalid IRIs are indistinguishable, then what's the problem?
>> Valid HTML5 URLs are IRIs.
>>
>> Invalid HTML5 URLs get special treatment in the spec (note I'm not 
>> arguing against that treatment). The confusion comes from the fact that 
>> when the spec says "URL" it really means any URL, not only valid ones.
> 
> I don't understand why would that be confusing, except for theoreticians 
> who consider invalid documents to not be relevant. I don't think most 
> people think "oh, if it's not valid it's not a URL".

Ian, it was *you* pointing out that valid URLs are IRIs. I agree that 
for HTML5 it's important to say what to do with invalid URLs, in which 
case there will be confusion between URLs as defined by HTML5, and 
URL/URI/IRI as defined elsewhere.

>> Understood again, but maybe it makes sense to ask the question again, 
>> now that all browser vendors are actually part of the same specification 
>> effort.
> 
> I've asked the question and been given a negative response. I encourage 
> you to follow up on this and see if you can get a better response.

It would be nice to see these kinds of discussions being part of the 
working group process, so that the other WG members can actually see 
what was being proposed, and what the answer was.

BR, Julian
Received on Sunday, 29 June 2008 11:30:05 UTC

This archive was generated by hypermail 2.3.1 : Monday, 29 September 2014 09:38:55 UTC