Re: "Web addresses in HTML 5" for review (ISSUE-56 urls-webarch) from Roy T. Fielding on 2009-03-19 (public-html@w3.org from March 2009)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Wed, 18 Mar 2009 18:21:01 -0700
To: Dan Connolly <connolly@w3.org>
Cc: public-html@w3.org
Message-Id: <B40F388A-3CF4-498C-BA18-5C85936F9292@gbiv.com>
On Mar 18, 2009, at 6:56 AM, Dan Connolly wrote:

> Attached find a draft for ACTION-68
... (converted to text)

>                             Web addresses in HTML 5

It would be better called "Referencing a URI in HTML5"

Seriously, I consider that to be a show-stopper.  There are these
things we have variously called UDIs, Universal RIs, URLs, URNs,
and now Uniform Resource Identifiers. Those are all Web addresses.
They defined by an Internet Standard.  We spent over 10 years getting
the entire world to agree on what they are and how they are communicated
across the Internet.  That work is done!  Starting over just because
HTML5 can't get Web terminology right is ridiculous.

Web addresses are not defined by HTML5 any more than the physical
address of your house is defined by an envelope sent via USPS mail.
What appears on the envelope is a reference to the physical address
which is interpreted by various postal carriers in their attempt
to route that mail to your postal box.  Some references are less
ambiguous than others.  Some can be parsed by machines.  Some
references can only be understood by the last carrier on the route.

What HTML5 defines is the process of getting from the reference
to the URI within the context of the generic syntax standard.
There is absolutely no reason to call that a new definition for
Web Addresses, URLs, or anything else like it.  The URI standard
applies to all Web technology and formats, not just HTML5 as it
might be used by future browsers.  Calling the input to that
process a "Web address" is needlessly confusing when the same
term is already commonly understood to be the output of that process.

>    This specification defines the term Web address, and defines  
> various

Web address is an already defined term.  This is the same objection
I made to calling these references URLs.  Just call them references.

What HTML documents contain are various attributes that have values
which reference one or more URIs through the use of a sequence of
characters in the document encoding.  In order to get from the CDATA
to the referenced URI, the parser must extract the sequence of
characters, transcode it to UTF-8, decompose the string into the
generic syntax components as defined by RFC 3986 (note that this works
even if the reference is invalid or an IRI), IDNA-encode (for host)
or percent-encode each of those components for any octets that are
not allowed in the URI syntax, and then resolve it relative to the
base URI as defined in RFC 3986.

In fact, I would argue that the above paragraph is sufficient to
define these strings for HTML5, but I have no doubt that others
will want to explain each of those parts in detail.  That's fine,
but only as long as you don't repeat or redefine the parts that
are already defined by 3986.  There is no need to.

>    algorithms for dealing with Web addresses, because for  
> historical reasons
>    the rules defined by the URI and IRI specifications are not a  
> complete
>    description of what HTML user agents need to implement to be  
> compatible
>    with Web content.

Bah! The URI and IRI specifications only define the parts that
are common to all Web technology.  There has never been any reason
for them to define HTML5 parsing.

> 1 Terminology
>
>    A Web address is a string used to identify a resource.
>
>    The term "Web address" in this specification is used to include  
> not only
>    Uniform Resource Identifiers (URIs) as they are defined by RFC  
> 3986 and
>    Internationalized Resource Identifiers (IRIs) as they are  
> defined by RFC
>    3987, but also other strings of characters which can be used to  
> identify
>    Web resources when processed appropriately.
>
>    A Web address is a valid Web address if at least one of the  
> following
>    conditions holds:
>
>      * The Web address is a valid URI reference (i.e. it matches  
> the grammar
>        for <URI-reference&ft; given in RFC 3986).
>
>      * The Web address is a valid IRI reference (i.e. it matches  
> the grammar
>        for <IRI-reference&ft; given in RFC 3987), and it has no query
>        component.
>
>      * The Web address is a valid IRI reference and its query  
> component
>        contains no unescaped non-ASCII characters [RFC3987].
>
>      * The Web address is a valid IRI reference and the character  
> encoding of
>        the Web address's Document is UTF-8 or UTF-16 [RFC3987].
>
>    A Web address has an associated URL character encoding,  
> determined as
>    follows:
>
>    If the Web address came from a script (e.g. as an argument to a  
> method)
>            The Web address character encoding is the script's  
> character
>            encoding.
>
>    If the Web address came from a DOM node (e.g. from an element)
>            The node has a Document, and the URL character encoding  
> is the
>            document's character encoding.
>
>    If the Web address had a character encoding defined when the Web  
> address
>    was created or defined
>            The Web address character encoding is as defined.

s/Web address/reference to a URI/g;

> 2 Parsing Web addresses

... sorry, this entire section is disconnected from reality.
It doesn't match any of the known implementations and directly
contradicts the standard.

> 3 Resolving Web addresses

... this section should be split into "establishing the base URI
for a given reference" and "resolving URI components."

Browser-specific error handling should be applied to each component
of the base URI and reference after they have been parsed into
components according to the algorithm in RFC3986.  There is no need
to repeat the URI syntax and resolution algorithm here.  The algorithm
in 3986 is specifically designed to accept any string, which means
that it is impossible for it to result in an "error" -- the bad
characters are just placed in the corresponding component and
the resolution process (the algorithm that interprets the meaning
of those components as forming a scheme-specific address) is
responsible for identifying the intra-component errors and
intra-component encoding requirements.  That is necessary because
the contents of a component are only an error when interpreted
according to the URI scheme's semantics.

Likewise, if the base URI is a valid URI and the reference
components are valid components in URI-encoded form, then the
resolution process (combining reference and base URI) will
always result in a valid URI.


Cheers,

Roy T. Fielding                            <http://roy.gbiv.com/>
Chief Scientist, Day Software              <http://www.day.com/>
Received on Thursday, 19 March 2009 01:21:30 UTC