Re: Non-hierarchical base URLs (was Re: draft-abarth-url-01 uploaded) from Maciej Stachowiak on 2011-04-25 (public-iri@w3.org from April 2011)

From: Maciej Stachowiak <mjs@apple.com>
Date: Mon, 25 Apr 2011 01:14:59 -0700
To: Adam Barth <ietf@adambarth.com>
Cc: Julian Reschke <julian.reschke@gmx.de>, public-iri@w3.org
Message-id: <9BCD70D7-E390-45D2-B939-4A887FAF37B2@apple.com>

On Apr 25, 2011, at 12:50 AM, Adam Barth wrote:

> On Mon, Apr 25, 2011 at 12:27 AM, Julian Reschke <julian.reschke@gmx.de> wrote:
>> On 24.04.2011 20:10, Adam Barth wrote:
>>> Finding the scheme aborts the "finding the scheme" algorithm (hence
>>> the separate section and the phrase "these steps") and reports that
>>> the URL is invalid when there is no scheme.  The algorithm for
>>> resolving a relative URL then continues down this branch "if
>>> relative-url is an invalid URL...".
>> 
>> Got it, thanks.
>> 
>> So, 4.1 says:
>> 
>>>   TODO: If base-url's scheme is not hierarchical, we can't resolve as a
>>>   relative URL.  We'll probably want to return an invalid URL.  Check
>>>   what happens when resolving an empty string as a relative URL with a
>>>   non-hierarchical base.
>> 
>> If you look at RFC 3986 you will see that there's no problem like that. Both
>> URIs and relative references are parsed into components, and the resolution
>> algorithm doesn't care where they came from, and has no extra knowledge of
>> "hierarchical" or specific schemes.
> 
> I don't believe you can correctly account for the behavior of existing
> browsers without classifying schemes into at least two categories.
> For the purposes of discussion, let's call those two categories
> hierarchical and non-hierarchical (but of course we could use whatever
> names we like).

I assume Adam knows this, but for everyone else's sake: there are at least two other categories with distinctive parsing behavior, namely file: and mailto:.

> 
> What does the following HTML alert?
> 
> <base href="data://foo/bar?baz#qux">
> <a href="taco.html">hello</a>
> <script>
> alert(document.getElementsByTagName('a')[0].href)
> </script>
> 
> What about the following?
> 
> <base href="http://foo/bar?baz#qux">
> <a href="taco.html">hello</a>
> <script>
> alert(document.getElementsByTagName('a')[0].href)
> </script>
> 
> The facts are that URL handling in browser is not uniform across
> schemes.  We might feel happy or sad about that, but that's how things
> are.  If we're going to write specs that tell the truth, then we need
> to acknowledge these infelicities rather than sticking our heads in
> the sand and pretending they aren't the case.

One possible degree of freedom is what to do with URL schemes that are not currently in use in the browser context. It may be possible to limit scheme-specific behavior to a fixed set of currently-known schemes, and let future schemes parse according to the generic syntax (i.e. presence of // after the scheme would determine whether there is an authority, not whether the scheme is known and whitelisted as "hierarchical"). The advantage would be that when future network protocols are introduced to browsers, parsing behavior of existing URLs won't suddenly change, and won't become inconsistent between browsers depending on whether they know the protocol already.

But clearly, browsers of today process http: and data: URLs differently, regardless of whether a // follows the colon. Recording this knowledge is clearly a good thing.

Regards,
Maciej

Received on Monday, 25 April 2011 08:15:29 UTC