Re: CfC: Request transition of HTML5 to Candidate Recommendation from Sam Ruby on 2014-09-30 (public-html-admin@w3.org from September 2014)

From: Sam Ruby <rubys@intertwingly.net>
Date: Mon, 29 Sep 2014 21:32:13 -0400
To: "Roy T. Fielding" <fielding@gbiv.com>
CC: HTML WG <public-html-admin@w3.org>
Message-ID: <542A081D.2060207@intertwingly.net>
I've taken a look at the recent snapshots of the URL specification.

Scheme no longer contains a colon, "authority-based URL" and 
"hierarchical URL" are no longer defined, and I don't see a section 
entitled "URL manipulation and creation".

Would you agree that what remains are nomenclature differences and the 
fact that this specification doesn't match observed behavior of existing 
user agents?

- Sam Ruby

On 11/25/2012 06:18 PM, Roy T. Fielding wrote:
> On Nov 14, 2012, at 2:15 PM, Sam Ruby wrote:
>
>> In accordance with both the W3C process's requirement to record the group's decision to request advancement[1], and with the steps identified in the "Plan 2014" CfC[2], this is a Call for Consensus (CfC) to request transition to CR for the following document:
>>
>> http://htmlwg.org/cr/html/index.html
>>
>> Silence will be taken to mean there is no objection, but positive
>> responses are encouraged. If there are no objections by Monday,
>> November 26th, this resolution will carry.
>>
>> Considerations to note:
>>
>> - A request to advance indicates that the Working Group believes the specification is stable and appropriate for implementation.
>>
>> - The specification MAY still change based on implementation experience.
>>
>> - Sam Ruby, on behalf of the W3C co-chairs
>>
>> [1] http://www.w3.org/2005/10/Process-20051014/tr.html#transition-reqs
>> [2] http://lists.w3.org/Archives/Public/public-html/2012Oct/0026.html
>
>
> This specification continues to use terminology and definitions
> that are arbitrarily different from the other specifications of
> Web architecture, resulting in needless argumentation in support
> of willful violations that are really just a failure to use the
> right terms at the right times.
>
>    URL       --> reference
>    resource  --> representation
>    encoding  --> charset (or character encoding scheme)
>
> The section on URLs
>
>    http://htmlwg.org/cr/html/urls.html
>
> is particularly egregious since it redefines URL to be a reference
> and then modifies the ABNF of RFC3986 in order to parse and
> resolve a URL (== reference) to an absolute URL, reversing the
> incorrect terms in order to invoke the other specification's
> algorithms.  AFAICT, the rest of the HTML5 specification does not
> need to use the term URL except as part of a defined phrase, such
> as "valid URL potentially surrounded by spaces".
>
> In fact, the places where the defined phrase is used do allow
> any string as input (a reference) and do not perform any sort
> of validation on that input.  They also don't treat arbitrary
> whitespace characters verbatim, as described in the algorithm.
> The places where validation is relevant are the DOM setters,
> which define their own conversion algorithms specific to each
> component.
>
> Where the 3986 parsed components are used, technical errors
> have been introduced for more unnecessary definitions. E.g.,
> sec 2.6.4:
>
>    An absolute URL is a hierarchical URL if, when resolved and
>    then parsed, there is a character immediately after the
>    <scheme> component and it is a "/" (U+002F) character.
>
>    An absolute URL is an authority-based URL if, when resolved
>    and then parsed, there are two characters immediately after
>    the <scheme> component and they are both "//" (U+002F)
>    characters.
>
> In both cases, the character immediately after the <scheme>
> component is a colon (":"), because the colon is a separator
> and not part of the component.  That is, unlike the DOM
> attribute "protocol", which (due to some ancient bug) has a
> getter that appends the ":" to a scheme.
>
> It would be correct to say that an absolute URL is
> authority-based if, after parsing, the authority component
> is defined.  Likewise, a URL is hierarchical if it is
> authority-based or the pathname begins with "/".  However,
> I doubt that the specification needs either of these terms;
> if they are used somewhere, then define them where they are used.
>
> And in 2.6.7
>
>     o.protocol [ = value ]
>
>     Returns the current scheme of the underlying URL.
>
>     Can be set, to change the underlying URL's scheme.
>
> is likewise incorrect because it returns the URL scheme and ":".
> I am not sure what happens when it is set, with or without a ":".
>
> Sec 2.6.5 apparently defines an incorrect fragment-escape
> algorithm within a section entitled "URL manipulation and
> creation".  I am not sure how that is supposed to be used.
> If it is just for fragments, then the section title should
> be corrected and the algorithm changed to avoid double-encoding
> an existing pct-encoded sequence.  If it is for any URL component,
> then delete the section because it is hopelessly wrong.
>
> I also find it curious that the spec defines the meaning and
> attributes of the anchor (a) element within the section on Link
> (without an actual section xref).  It should say somewhere that
> the a element's href attribute contains a reference (not a URL)
> and that the a element's href DOM property is the output of
> transcoding and resolving that reference to absolute URL
> form (including a fragment, if any).
>
> Instead, it says in
>
>    http://htmlwg.org/cr/html/the-a-element.html#the-a-element
>
>    The IDL attributes href, target, rel, media, hreflang, and type,
>    must reflect the respective content attributes of the same name.
>
> which is not consistent with how the href DOM attribute is
> implemented in Firefox and Chrome (other UAs not yet tested).
>
> Examples of how embedded whitespace is treated can be seen
> by looking at the href DOM property's result of references
> like:
>
> <PRE>
> <a href=" g "> g (leading and trailing)</a>    = http://a/b/c/g
>
> <a href="g o">g o (embedded)</a>               = http://a/b/c/g%20o
>
> <a href="g
> o">g o (embedded linefeed)</a>                 = http://a/b/c/go
>
> <a href="g
>   o">g o (embedded space linefeed space)</a>    = http://a/b/c/g%20o
>
> <a href="g
>   o">g o (embedded linefeed and space)</a>      = http://a/b/c/g%20o
>
> <a href="g
>    o">g o (embedded linefeed and 2 spaces)</a>  = http://a/b/c/g%20%20o
>
> <a href="g	o">g o (embedded tab)</a>      = http://a/b/c/go
>
> <a href="g
> 	o">g o (embedded linefeed and tab)</a> = http://a/b/c/go
>
> <a href="g
> 	 o">g o (embedded linefeed space tab)</a> = http://a/b/c/g%20o
>
> <a href="g
> 	o">g o (embedded space linefeed tab)</a> = http://a/b/c/g%20o
> </PRE>
>
> In other words, this would suggest that linefeeds and HTAB
> characters are ignored, along with leading and trailing SP
> characters, but each embedded SP is replaced with a %20.
> This was tested on Chrome, so other UAs might differ,
> and I have only been testing <a href>, not the other contexts
> where references are used in HTML.
>
>
> RFC3986 does not define a single standard for converting an
> arbitrary string reference into a standard URL.  The reason it
> does not do so is because those rules have (in the past) differed
> based on context, such as the differences in algorithms for
> <a href>, <form>, <img src>, and the Location dialog/bar on
> GUI-based browsers.  There is even less commonality among reference
> algorithms across different data formats (RFC3986 defines URLs
> for the entire Internet, not just HTML). It has been assumed
> that individual data formats, like HTML, will define their own
> algorithms for converting reference strings, using something
> like the regular expression in the appendix to split the
> arbitrary string into the syntax components.
>
> That conversion algorithm has to take into account subjects that
> 3986 doesn't even attempt to address, like the document character
> encoding scheme, surrounding and embedded whitespace, and how to
> compose a query component.  Defining those things in HTML5 is not
> a willful violation of RFC3986, for the same reason that converting
> HTML character entities before processing them isn't a violation;
> it is simply preprocessing the supplied data in order to form the
> URL.
>
> What would be a violation of RFC3986 is if a UA were to send an
> arbitrary reference string, without pct-encoding the invalid
> characters and resolving it relative to the base, in a protocol
> element that expects a valid URL (e.g., the request target of
> an HTTP request).
>
> A better algorithm for Resolving References would accurately
> describe how embedded whitespace is stripped or replaced with
> a single pct-encoded space, components split using a regular
> expression (as in RFC3986), and non-URL characters processed in
> a component-specific way, to produce the URL that is used for
> fetching and for the URL decomposition IDL attributes.
>
> Instead, the specification takes on a bizarre "Us vs The Man"
> attitude about 3986 (a standard for protocol elements SENT),
> redefines URL as an INPUT reference, converts that "URL" in
> place to absolute and pct-encoded form and calls that result
> the "URL", and then makes requests on the "URL" for the URL,
> sometimes in the same sentence (e.g., "When a URL is to be
> fetched, the URL identifies a resource to be obtained.").
>
> I don't care if the WG insists on using the acronym URL instead
> of URI -- they are defined to be equivalent in 3986.  I do
> care that the HTML5 spec is defining the input to its
> preprocessing as a URL and the output to its preprocessing
> as a URL, since that is both confusing and inaccurate.
>
> In my opinion, these inconsistencies should be fixed before
> HTML5 is advanced to CR.  These sections cause more damage
> than the benefit gained over simply referencing RFC3986.
> I am aware of Anne's work -- it does not seem intended to
> fix any of these inconsistencies and is not currently on
> track for HTML5.
>
> If not fixed, then these sections should be removed from HTML5
> and replaced with forward looking definitions to be defined by
> later extension specs.  For example, define a "Web reference"
> as an arbitrary string that is to be transformed into an
> absolute URL reference, from which the IDL attribute values
> are obtained and the activated actions are targeted.  RFC3986's
> algorithms are sufficient to define the components that
> make up the IDL attributes.
>
> If the WG decides to advance the HTML5 specification to CR
> without fixing these errors and inconsistencies, then please
> consider this a formal objection.
>
>
> Cheers,
>
> Roy T. Fielding                 <http://roy.gbiv.com/>
> Sr. Principal Scientist, Adobe  <http://adobe.com/>
>
Received on Tuesday, 30 September 2014 01:32:43 UTC