- From: Sam Ruby <rubys@intertwingly.net>
- Date: Mon, 29 Sep 2014 21:32:13 -0400
- To: "Roy T. Fielding" <fielding@gbiv.com>
- CC: HTML WG <public-html-admin@w3.org>
I've taken a look at the recent snapshots of the URL specification. Scheme no longer contains a colon, "authority-based URL" and "hierarchical URL" are no longer defined, and I don't see a section entitled "URL manipulation and creation". Would you agree that what remains are nomenclature differences and the fact that this specification doesn't match observed behavior of existing user agents? - Sam Ruby On 11/25/2012 06:18 PM, Roy T. Fielding wrote: > On Nov 14, 2012, at 2:15 PM, Sam Ruby wrote: > >> In accordance with both the W3C process's requirement to record the group's decision to request advancement[1], and with the steps identified in the "Plan 2014" CfC[2], this is a Call for Consensus (CfC) to request transition to CR for the following document: >> >> http://htmlwg.org/cr/html/index.html >> >> Silence will be taken to mean there is no objection, but positive >> responses are encouraged. If there are no objections by Monday, >> November 26th, this resolution will carry. >> >> Considerations to note: >> >> - A request to advance indicates that the Working Group believes the specification is stable and appropriate for implementation. >> >> - The specification MAY still change based on implementation experience. >> >> - Sam Ruby, on behalf of the W3C co-chairs >> >> [1] http://www.w3.org/2005/10/Process-20051014/tr.html#transition-reqs >> [2] http://lists.w3.org/Archives/Public/public-html/2012Oct/0026.html > > > This specification continues to use terminology and definitions > that are arbitrarily different from the other specifications of > Web architecture, resulting in needless argumentation in support > of willful violations that are really just a failure to use the > right terms at the right times. > > URL --> reference > resource --> representation > encoding --> charset (or character encoding scheme) > > The section on URLs > > http://htmlwg.org/cr/html/urls.html > > is particularly egregious since it redefines URL to be a reference > and then modifies the ABNF of RFC3986 in order to parse and > resolve a URL (== reference) to an absolute URL, reversing the > incorrect terms in order to invoke the other specification's > algorithms. AFAICT, the rest of the HTML5 specification does not > need to use the term URL except as part of a defined phrase, such > as "valid URL potentially surrounded by spaces". > > In fact, the places where the defined phrase is used do allow > any string as input (a reference) and do not perform any sort > of validation on that input. They also don't treat arbitrary > whitespace characters verbatim, as described in the algorithm. > The places where validation is relevant are the DOM setters, > which define their own conversion algorithms specific to each > component. > > Where the 3986 parsed components are used, technical errors > have been introduced for more unnecessary definitions. E.g., > sec 2.6.4: > > An absolute URL is a hierarchical URL if, when resolved and > then parsed, there is a character immediately after the > <scheme> component and it is a "/" (U+002F) character. > > An absolute URL is an authority-based URL if, when resolved > and then parsed, there are two characters immediately after > the <scheme> component and they are both "//" (U+002F) > characters. > > In both cases, the character immediately after the <scheme> > component is a colon (":"), because the colon is a separator > and not part of the component. That is, unlike the DOM > attribute "protocol", which (due to some ancient bug) has a > getter that appends the ":" to a scheme. > > It would be correct to say that an absolute URL is > authority-based if, after parsing, the authority component > is defined. Likewise, a URL is hierarchical if it is > authority-based or the pathname begins with "/". However, > I doubt that the specification needs either of these terms; > if they are used somewhere, then define them where they are used. > > And in 2.6.7 > > o.protocol [ = value ] > > Returns the current scheme of the underlying URL. > > Can be set, to change the underlying URL's scheme. > > is likewise incorrect because it returns the URL scheme and ":". > I am not sure what happens when it is set, with or without a ":". > > Sec 2.6.5 apparently defines an incorrect fragment-escape > algorithm within a section entitled "URL manipulation and > creation". I am not sure how that is supposed to be used. > If it is just for fragments, then the section title should > be corrected and the algorithm changed to avoid double-encoding > an existing pct-encoded sequence. If it is for any URL component, > then delete the section because it is hopelessly wrong. > > I also find it curious that the spec defines the meaning and > attributes of the anchor (a) element within the section on Link > (without an actual section xref). It should say somewhere that > the a element's href attribute contains a reference (not a URL) > and that the a element's href DOM property is the output of > transcoding and resolving that reference to absolute URL > form (including a fragment, if any). > > Instead, it says in > > http://htmlwg.org/cr/html/the-a-element.html#the-a-element > > The IDL attributes href, target, rel, media, hreflang, and type, > must reflect the respective content attributes of the same name. > > which is not consistent with how the href DOM attribute is > implemented in Firefox and Chrome (other UAs not yet tested). > > Examples of how embedded whitespace is treated can be seen > by looking at the href DOM property's result of references > like: > > <PRE> > <a href=" g "> g (leading and trailing)</a> = http://a/b/c/g > > <a href="g o">g o (embedded)</a> = http://a/b/c/g%20o > > <a href="g > o">g o (embedded linefeed)</a> = http://a/b/c/go > > <a href="g > o">g o (embedded space linefeed space)</a> = http://a/b/c/g%20o > > <a href="g > o">g o (embedded linefeed and space)</a> = http://a/b/c/g%20o > > <a href="g > o">g o (embedded linefeed and 2 spaces)</a> = http://a/b/c/g%20%20o > > <a href="g o">g o (embedded tab)</a> = http://a/b/c/go > > <a href="g > o">g o (embedded linefeed and tab)</a> = http://a/b/c/go > > <a href="g > o">g o (embedded linefeed space tab)</a> = http://a/b/c/g%20o > > <a href="g > o">g o (embedded space linefeed tab)</a> = http://a/b/c/g%20o > </PRE> > > In other words, this would suggest that linefeeds and HTAB > characters are ignored, along with leading and trailing SP > characters, but each embedded SP is replaced with a %20. > This was tested on Chrome, so other UAs might differ, > and I have only been testing <a href>, not the other contexts > where references are used in HTML. > > > RFC3986 does not define a single standard for converting an > arbitrary string reference into a standard URL. The reason it > does not do so is because those rules have (in the past) differed > based on context, such as the differences in algorithms for > <a href>, <form>, <img src>, and the Location dialog/bar on > GUI-based browsers. There is even less commonality among reference > algorithms across different data formats (RFC3986 defines URLs > for the entire Internet, not just HTML). It has been assumed > that individual data formats, like HTML, will define their own > algorithms for converting reference strings, using something > like the regular expression in the appendix to split the > arbitrary string into the syntax components. > > That conversion algorithm has to take into account subjects that > 3986 doesn't even attempt to address, like the document character > encoding scheme, surrounding and embedded whitespace, and how to > compose a query component. Defining those things in HTML5 is not > a willful violation of RFC3986, for the same reason that converting > HTML character entities before processing them isn't a violation; > it is simply preprocessing the supplied data in order to form the > URL. > > What would be a violation of RFC3986 is if a UA were to send an > arbitrary reference string, without pct-encoding the invalid > characters and resolving it relative to the base, in a protocol > element that expects a valid URL (e.g., the request target of > an HTTP request). > > A better algorithm for Resolving References would accurately > describe how embedded whitespace is stripped or replaced with > a single pct-encoded space, components split using a regular > expression (as in RFC3986), and non-URL characters processed in > a component-specific way, to produce the URL that is used for > fetching and for the URL decomposition IDL attributes. > > Instead, the specification takes on a bizarre "Us vs The Man" > attitude about 3986 (a standard for protocol elements SENT), > redefines URL as an INPUT reference, converts that "URL" in > place to absolute and pct-encoded form and calls that result > the "URL", and then makes requests on the "URL" for the URL, > sometimes in the same sentence (e.g., "When a URL is to be > fetched, the URL identifies a resource to be obtained."). > > I don't care if the WG insists on using the acronym URL instead > of URI -- they are defined to be equivalent in 3986. I do > care that the HTML5 spec is defining the input to its > preprocessing as a URL and the output to its preprocessing > as a URL, since that is both confusing and inaccurate. > > In my opinion, these inconsistencies should be fixed before > HTML5 is advanced to CR. These sections cause more damage > than the benefit gained over simply referencing RFC3986. > I am aware of Anne's work -- it does not seem intended to > fix any of these inconsistencies and is not currently on > track for HTML5. > > If not fixed, then these sections should be removed from HTML5 > and replaced with forward looking definitions to be defined by > later extension specs. For example, define a "Web reference" > as an arbitrary string that is to be transformed into an > absolute URL reference, from which the IDL attribute values > are obtained and the activated actions are targeted. RFC3986's > algorithms are sufficient to define the components that > make up the IDL attributes. > > If the WG decides to advance the HTML5 specification to CR > without fixing these errors and inconsistencies, then please > consider this a formal objection. > > > Cheers, > > Roy T. Fielding <http://roy.gbiv.com/> > Sr. Principal Scientist, Adobe <http://adobe.com/> >
Received on Tuesday, 30 September 2014 01:32:43 UTC