Re: URL work in HTML 5 (semifork)

On 2012/10/16 1:30, Robin Berjon wrote:
> On 15/10/2012 17:49 , Ted Hardie wrote:
>> On Mon, Oct 15, 2012 at 8:07 AM, Robin Berjon <robin@w3.org> wrote:
>>> URLs to non-Web things (e.g. mailto:, smsto:, tel:, etc.) happen in Web
>>> contexts. Libraries written to process those in Web contexts are
>>> likely to
>>> be reused elsewhere. There isn't really an option to have some of
>>> this in
>>> Web use cases and something else outside of it. If it's used for the
>>> Web, it
>>> *will* leak. Probably a lot, and probably fast.

One first question is how much we want it to leak. An example that Anne 
brought up is a URL with a space character in it. It is clear that these 
things exist on the Web, in not too small numbers. On the other hand, 
it's also clear that there are many places (some of them defined by 
specs, some of them just somewhere in scripts and the like) that will 
just 'blow up' when they get a space.

Do we want to make sure that all browsers treat such a space in the same 
way? Most probably yes, and in this case, maybe they already do. Does it 
make sense to write that down? I'd also very much say yes.

Do we want to make sure that all other places that accept URIs or IRIs 
also accept a space and treat it the same? Maybe we would like to do so, 
but is it possible? Quite clearly no (just think HTTP request header).

This essentially means that the fork is already here. In some sense, 
that's really bad news. But if we look more closely, the news may not be 
that bad. First, at least for the case with the space in it, we know how 
to convert it to an equivalent without a space: use %20 (except maybe in 
form parts). But we need to make sure that this is written down somewhere.

Second, and that will be more obvious for some more esoteric cases than 
just a space, I think that even among those who agree that such cases 
should be described, and should be handled uniformly by browsers, there 
will be quite some agreement that it's better not to produce such things.

What we end up with is something I'd call a semi-fork, which is a subset 
of "recommended" URIs/IRIs within a larger set of (sometimes, but not 
always) tolerated ones.

We already have this for the XML case, it's called LEIRIs 
(http://tools.ietf.org/html/draft-ietf-iri-3987bis-12#section-6).

At one point, we tried to do something similar to what Anne is now 
trying to address, but we did not get very far because once one goes 
beyond the simple cases (such as a space), it gets messy quite quickly 
(read: different browsers do different things). Even though there are 
representatives of all major browser vendors subscribed to the IRI WG 
mailing list, we also didn't get much in terms of contributions or 
feedback (Adam and Anne occasionally were exceptions).

>> I agree. But that argues that an xmpp URI seen in a jabber context
>> and an xmpp URI seen in a web context should be the same;

Syntactically correct xmpp URIs should be the same indeed, and I think 
they currently are.

>> or, to
>> re-iterate, that a fork would be harmful. Changing the URI parsing in
>> web contexts only is likely to be problematic because of leakage.
>> Avoiding that by retaining one way is my personal preference for the
>> way forward. But if those working on web-specific specs do not agree
>> and choose to fork, then we *must* mark the difference between the
>> contexts, or the results will be even worse.
>
> I think that we're in ruthlessly violent agreement here :)
>
> At this point we have to look at what status Anne's work could be
> published under. It doesn't have to be a fork, it could simply be
> published as The One True Way to parse URLs (after reviews, etc.
> obviously). Is that something that could be acceptable?

I think it can easily by the One True Way to parse URLs in Web Browsers. 
Given some of the current differences between browsers, even that may be 
though, but I very much hope that Anne can be successful.

I think that in a way similar to how the HTML5 spec currently 
distinguishes between an authoring version and a parsing version, Anne's 
document can be the parsing version for Web browsers, and RFC 3986, and 
3987bis, can be the authoring version(s).

Of course, that's not a strict parallel. As an example, Anne plans to 
clearly document/spec how URL equivalence works in JavaScript. For 
everybody who uses JavaScript, this will clearly be a good thing. 
However, as http://tools.ietf.org/html/rfc3986#section-6,
http://tools.ietf.org/html/rfc3987#section-5, and 
http://tools.ietf.org/html/draft-ietf-iri-comparison-01 should make 
quite clear, how to compare URIs/IRIs/URLs depends very much on the 
application. On one end, a spider will make as many shortcurts as 
possible, where on the other end, XML namespaces and RDF will do 
codepoint-by-codepoint comparison, and there is clearly some value in 
documenting that. (Also, an extended JavaScript library may provide 
quite a few variants to deal with these application needs.)


Last but not least, I would like to mention that if there's anything 
that we can reasonably do to make the gap in the semifork narrower, then 
we should give it a try. Two examples: First, RFC 3987 was quite strict 
about character normalization in some circumstances. It has turned out 
that browsers did it differently, so we changed the spec. Also, we had 
to find out that query parts don't get converted using UTF-8 as often as 
we would like. So we also adapted the spec, even though that's still 
under discussion. If there are other cases that we *can* address, please 
tell us. On the other hand, I'd hope that with the work that Anne does, 
he also tries to narrow the gap where possible, e.g. by choosing a 
solution closer to RFC 3986/3987bis where browsers disagree.


Regards,   Martin.

Received on Tuesday, 16 October 2012 05:37:33 UTC