Re: Non-hierarchical base URLs (was Re: draft-abarth-url-01 uploaded) from Roy T. Fielding on 2011-05-02 (public-iri@w3.org from May 2011)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Mon, 2 May 2011 16:33:00 -0700
To: Maciej Stachowiak <mjs@apple.com>
Cc: Adam Barth <ietf@adambarth.com>, public-iri@w3.org
Message-Id: <42B7FB82-93A9-4D07-A0E0-9A4C842C28E7@gbiv.com>

On Apr 28, 2011, at 12:29 AM, Maciej Stachowiak wrote:
> On Apr 27, 2011, at 10:12 PM, Roy T. Fielding wrote:
> 
>> As you well know, what HTML5 needs is a definition for parsing
>> arbitrary attribute values in document encoding.  Those attribute
>> values are not URLs.  They aren't even URI references.  They are
>> one or more space-separated or space-ignoring strings in an HTML
>> attribute encoding, and each reference needs to be extracted and
>> transcoded before the definitions in 3986 are even applicable.
> 
> It is fine to call the types of resource identifiers that appear in HTML and other parts of the Web platform (CSS, XHR, SVG, etc) something other than "URL" or "URI". The name does not really matter for interoperability.

Technically, yes, but from a social perspective we have seen a
lot of confusion caused by descriptions of URL, URI, or anyURI
being the documented value range of attributes or data entry
dialogs.  That was actually true in some distant past, but
browsers stopped rejecting invalid references a long time ago,
for good reasons, and usually do some form of pct-encoding or
truncation instead.  We should therefore stop referring to the
input as a uniform identifier when it is, in fact, an arbitrary
string.  We can then unambiguously refer to the output of the
parsing, transcoding, and recombination algorithm as a URI or
URL as defined by RFC3986.  That is why I use the term reference
for the value as found in the attribute/dialog.

>> However, for the subset of possible references that do happen
>> to match what are called valid URI references by RFC3986, then
>> we have already tested consensus and deployed many implementations
>> that conform exactly to the results given in RFC3986.  
> 
> If these references are something other than URIs, and must be transcoded, why is it important that the subset that happens to look syntactically like a valid URI must be processed without that transcoding step? This implies that the transcoding must be the identity encoding in some cases. Where does that assumption come from?

Authors have been using plain old ASCII references to URIs for
longer than the Web has been documented.  We expect them to
still work.  Likewise for references that are in the document
encoding but only use the subset of characters that are found
in ASCII.  URIs are defined in terms of characters, not octets,
so the transcoding I am referring to is the removal of whitespace,
pct-encoding of non-unreserved characters, etc.  A reference that
is already in URI form does not need to be transcoded.

....Roy

Received on Monday, 2 May 2011 23:33:25 UTC