Re: Non-hierarchical base URLs (was Re: draft-abarth-url-01 uploaded) from Adam Barth on 2011-05-03 (public-iri@w3.org from May 2011)

From: Adam Barth <ietf@adambarth.com>
Date: Mon, 2 May 2011 17:42:16 -0700
To: "Roy T. Fielding" <fielding@gbiv.com>
Cc: Maciej Stachowiak <mjs@apple.com>, public-iri@w3.org
Message-ID: <BANLkTi=mtFPS7Mh48Tc2NfqkxqKhX2UhFA@mail.gmail.com>

On Mon, May 2, 2011 at 4:33 PM, Roy T. Fielding <fielding@gbiv.com> wrote:
> On Apr 28, 2011, at 12:29 AM, Maciej Stachowiak wrote:
>> On Apr 27, 2011, at 10:12 PM, Roy T. Fielding wrote:
>>
>>> As you well know, what HTML5 needs is a definition for parsing
>>> arbitrary attribute values in document encoding.  Those attribute
>>> values are not URLs.  They aren't even URI references.  They are
>>> one or more space-separated or space-ignoring strings in an HTML
>>> attribute encoding, and each reference needs to be extracted and
>>> transcoded before the definitions in 3986 are even applicable.
>>
>> It is fine to call the types of resource identifiers that appear in HTML and other parts of the Web platform (CSS, XHR, SVG, etc) something other than "URL" or "URI". The name does not really matter for interoperability.
>
> Technically, yes, but from a social perspective we have seen a
> lot of confusion caused by descriptions of URL, URI, or anyURI
> being the documented value range of attributes or data entry
> dialogs.  That was actually true in some distant past, but
> browsers stopped rejecting invalid references a long time ago,
> for good reasons, and usually do some form of pct-encoding or
> truncation instead.  We should therefore stop referring to the
> input as a uniform identifier when it is, in fact, an arbitrary
> string.  We can then unambiguously refer to the output of the
> parsing, transcoding, and recombination algorithm as a URI or
> URL as defined by RFC3986.  That is why I use the term reference
> for the value as found in the attribute/dialog.
>
>>> However, for the subset of possible references that do happen
>>> to match what are called valid URI references by RFC3986, then
>>> we have already tested consensus and deployed many implementations
>>> that conform exactly to the results given in RFC3986.
>>
>> If these references are something other than URIs, and must be transcoded, why is it important that the subset that happens to look syntactically like a valid URI must be processed without that transcoding step? This implies that the transcoding must be the identity encoding in some cases. Where does that assumption come from?
>
> Authors have been using plain old ASCII references to URIs for
> longer than the Web has been documented.  We expect them to
> still work.  Likewise for references that are in the document
> encoding but only use the subset of characters that are found
> in ASCII.  URIs are defined in terms of characters, not octets,
> so the transcoding I am referring to is the removal of whitespace,
> pct-encoding of non-unreserved characters, etc.  A reference that
> is already in URI form does not need to be transcoded.

You're missing the constraint that browser vendors aren't going to
change their implementations to align with this dream.  Our choice is
between having the specification reflect that reality or having the
spec tell a lie.

Adam

Received on Tuesday, 3 May 2011 00:43:15 UTC