- From: Roy T. Fielding <fielding@gbiv.com>
- Date: Sun, 25 Nov 2012 15:18:15 -0800
- To: Sam Ruby <rubys@intertwingly.net>
- Cc: HTML WG <public-html@w3.org>
On Nov 14, 2012, at 2:15 PM, Sam Ruby wrote: > In accordance with both the W3C process's requirement to record the group's decision to request advancement[1], and with the steps identified in the "Plan 2014" CfC[2], this is a Call for Consensus (CfC) to request transition to CR for the following document: > > http://htmlwg.org/cr/html/index.html > > Silence will be taken to mean there is no objection, but positive > responses are encouraged. If there are no objections by Monday, > November 26th, this resolution will carry. > > Considerations to note: > > - A request to advance indicates that the Working Group believes the specification is stable and appropriate for implementation. > > - The specification MAY still change based on implementation experience. > > - Sam Ruby, on behalf of the W3C co-chairs > > [1] http://www.w3.org/2005/10/Process-20051014/tr.html#transition-reqs > [2] http://lists.w3.org/Archives/Public/public-html/2012Oct/0026.html This specification continues to use terminology and definitions that are arbitrarily different from the other specifications of Web architecture, resulting in needless argumentation in support of willful violations that are really just a failure to use the right terms at the right times. URL --> reference resource --> representation encoding --> charset (or character encoding scheme) The section on URLs http://htmlwg.org/cr/html/urls.html is particularly egregious since it redefines URL to be a reference and then modifies the ABNF of RFC3986 in order to parse and resolve a URL (== reference) to an absolute URL, reversing the incorrect terms in order to invoke the other specification's algorithms. AFAICT, the rest of the HTML5 specification does not need to use the term URL except as part of a defined phrase, such as "valid URL potentially surrounded by spaces". In fact, the places where the defined phrase is used do allow any string as input (a reference) and do not perform any sort of validation on that input. They also don't treat arbitrary whitespace characters verbatim, as described in the algorithm. The places where validation is relevant are the DOM setters, which define their own conversion algorithms specific to each component. Where the 3986 parsed components are used, technical errors have been introduced for more unnecessary definitions. E.g., sec 2.6.4: An absolute URL is a hierarchical URL if, when resolved and then parsed, there is a character immediately after the <scheme> component and it is a "/" (U+002F) character. An absolute URL is an authority-based URL if, when resolved and then parsed, there are two characters immediately after the <scheme> component and they are both "//" (U+002F) characters. In both cases, the character immediately after the <scheme> component is a colon (":"), because the colon is a separator and not part of the component. That is, unlike the DOM attribute "protocol", which (due to some ancient bug) has a getter that appends the ":" to a scheme. It would be correct to say that an absolute URL is authority-based if, after parsing, the authority component is defined. Likewise, a URL is hierarchical if it is authority-based or the pathname begins with "/". However, I doubt that the specification needs either of these terms; if they are used somewhere, then define them where they are used. And in 2.6.7 o.protocol [ = value ] Returns the current scheme of the underlying URL. Can be set, to change the underlying URL's scheme. is likewise incorrect because it returns the URL scheme and ":". I am not sure what happens when it is set, with or without a ":". Sec 2.6.5 apparently defines an incorrect fragment-escape algorithm within a section entitled "URL manipulation and creation". I am not sure how that is supposed to be used. If it is just for fragments, then the section title should be corrected and the algorithm changed to avoid double-encoding an existing pct-encoded sequence. If it is for any URL component, then delete the section because it is hopelessly wrong. I also find it curious that the spec defines the meaning and attributes of the anchor (a) element within the section on Link (without an actual section xref). It should say somewhere that the a element's href attribute contains a reference (not a URL) and that the a element's href DOM property is the output of transcoding and resolving that reference to absolute URL form (including a fragment, if any). Instead, it says in http://htmlwg.org/cr/html/the-a-element.html#the-a-element The IDL attributes href, target, rel, media, hreflang, and type, must reflect the respective content attributes of the same name. which is not consistent with how the href DOM attribute is implemented in Firefox and Chrome (other UAs not yet tested). Examples of how embedded whitespace is treated can be seen by looking at the href DOM property's result of references like: <PRE> <a href=" g "> g (leading and trailing)</a> = http://a/b/c/g <a href="g o">g o (embedded)</a> = http://a/b/c/g%20o <a href="g o">g o (embedded linefeed)</a> = http://a/b/c/go <a href="g o">g o (embedded space linefeed space)</a> = http://a/b/c/g%20o <a href="g o">g o (embedded linefeed and space)</a> = http://a/b/c/g%20o <a href="g o">g o (embedded linefeed and 2 spaces)</a> = http://a/b/c/g%20%20o <a href="g o">g o (embedded tab)</a> = http://a/b/c/go <a href="g o">g o (embedded linefeed and tab)</a> = http://a/b/c/go <a href="g o">g o (embedded linefeed space tab)</a> = http://a/b/c/g%20o <a href="g o">g o (embedded space linefeed tab)</a> = http://a/b/c/g%20o </PRE> In other words, this would suggest that linefeeds and HTAB characters are ignored, along with leading and trailing SP characters, but each embedded SP is replaced with a %20. This was tested on Chrome, so other UAs might differ, and I have only been testing <a href>, not the other contexts where references are used in HTML. RFC3986 does not define a single standard for converting an arbitrary string reference into a standard URL. The reason it does not do so is because those rules have (in the past) differed based on context, such as the differences in algorithms for <a href>, <form>, <img src>, and the Location dialog/bar on GUI-based browsers. There is even less commonality among reference algorithms across different data formats (RFC3986 defines URLs for the entire Internet, not just HTML). It has been assumed that individual data formats, like HTML, will define their own algorithms for converting reference strings, using something like the regular expression in the appendix to split the arbitrary string into the syntax components. That conversion algorithm has to take into account subjects that 3986 doesn't even attempt to address, like the document character encoding scheme, surrounding and embedded whitespace, and how to compose a query component. Defining those things in HTML5 is not a willful violation of RFC3986, for the same reason that converting HTML character entities before processing them isn't a violation; it is simply preprocessing the supplied data in order to form the URL. What would be a violation of RFC3986 is if a UA were to send an arbitrary reference string, without pct-encoding the invalid characters and resolving it relative to the base, in a protocol element that expects a valid URL (e.g., the request target of an HTTP request). A better algorithm for Resolving References would accurately describe how embedded whitespace is stripped or replaced with a single pct-encoded space, components split using a regular expression (as in RFC3986), and non-URL characters processed in a component-specific way, to produce the URL that is used for fetching and for the URL decomposition IDL attributes. Instead, the specification takes on a bizarre "Us vs The Man" attitude about 3986 (a standard for protocol elements SENT), redefines URL as an INPUT reference, converts that "URL" in place to absolute and pct-encoded form and calls that result the "URL", and then makes requests on the "URL" for the URL, sometimes in the same sentence (e.g., "When a URL is to be fetched, the URL identifies a resource to be obtained."). I don't care if the WG insists on using the acronym URL instead of URI -- they are defined to be equivalent in 3986. I do care that the HTML5 spec is defining the input to its preprocessing as a URL and the output to its preprocessing as a URL, since that is both confusing and inaccurate. In my opinion, these inconsistencies should be fixed before HTML5 is advanced to CR. These sections cause more damage than the benefit gained over simply referencing RFC3986. I am aware of Anne's work -- it does not seem intended to fix any of these inconsistencies and is not currently on track for HTML5. If not fixed, then these sections should be removed from HTML5 and replaced with forward looking definitions to be defined by later extension specs. For example, define a "Web reference" as an arbitrary string that is to be transformed into an absolute URL reference, from which the IDL attribute values are obtained and the activated actions are targeted. RFC3986's algorithms are sufficient to define the components that make up the IDL attributes. If the WG decides to advance the HTML5 specification to CR without fixing these errors and inconsistencies, then please consider this a formal objection. Cheers, Roy T. Fielding <http://roy.gbiv.com/> Sr. Principal Scientist, Adobe <http://adobe.com/>
Received on Sunday, 25 November 2012 23:18:39 UTC