- From: Philip TAYLOR <P.Taylor@Rhul.Ac.Uk>
- Date: Tue, 18 Nov 2008 17:21:27 +0000
- To: Ian Hickson <ian@hixie.ch>
- CC: Elliotte Harold <elharo@metalab.unc.edu>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, noah_mendelsohn@us.ibm.com, public-html <public-html@w3.org>, www-tag@w3.org
Ian Hickson wrote: > How is this different from what HTML4 did? HTML4 said "this is what is > valid, and everything else should work too". and later clarified this by citing various section of the HTML 4.01 specification : >> 6.16 paragraph 1 sentence 2, >> section 7.2 second bullet point of the first note, >> last paragraph of section 9.3.1, >> last paragraph of the definition of the "span" attribute in section 11.2.4, >> third paragraph of section 16.2, >> definition of the "checked" attribute in section 17.4, >> and, probably most importantly, section B.1. >> To name but a few. So let's see how the cited evidence supports the claim, remembering that the claim is that "/everything/ else should work too", not that "some things that are not explicitly specified should work too" >> 6.16 paragraph 1 sentence 2, "User agents should ignore all other target names." In other words, target names that are not explicitly mentioned in the spec. should be ignored. >> section 7.2 second bullet point of the first note, "Software conforming to the DTDs of the present specification may ignore features of future HTML 4 DTDs that it does not recognize." IOW, a browser that can read and process DTDs may safely ignore features specified in DTDs more recent that the DTD to which the specification refers. >> last paragraph of section 9.3.1, "We discourage authors from using empty P elements. User agents should ignore empty P elements." IOW, empty P elements are legal and should be ignored. >> last paragraph of the definition of the "span" attribute in section 11.2.4, "User agents must ignore this attribute if the COLGROUP element contains one or more COL elements." IOW, there exists at least one context in which the SPAN attribute should be ignored. >> third paragraph of section 16.2, "Elements that might normally be placed in the BODY element must not appear before the first FRAMESET element or the FRAMESET will be ignored." IOW, FRAMESET must precede any other elements that would normally be placed in the BODY element; if this constraint is violated, the FRAMESET will be ignored. >> definition of the "checked" attribute in section 17.4, "checked [CI] [p.49] When the type attribute has the value "radio" or "checkbox", this boolean attribute specifies that the button is on. User agents must ignore this attribute for other control types." IOW, if the attribute "checked" is used in contexts where the corresponding "type" attribute has neither the value "radio" nor "checkbox", the "checked" attrbute must be ignored. >> and, probably most importantly, section B.1. [Note : this is a long extract -- I hope that reproducing it here does not violated W3C copyright -- my comments in square brackets] "B.1 Notes on invalid documents This specification does not define how conforming user agents handle general error conditions, including how user agents behave when they encounter elements, attributes, attribute values, or entities not specified in this document. However, to facilitate experimentation and interoperability between implementations of various versions of HTML, we recommend the following behavior: If a user agent encounters an element it does not recognize, it should try to render the element’s content. [Unrecognised elements should be rendered as if the surrounding tags were absent] If a user agent encounters an attribute it does not recognize, it should ignore the entire attribute specification (i.e., the attribute and its value). [Unrecognised attributes must be ignored along with their values] If a user agent encounters an attribute value it doesn’t recognize, it should use the default attribute value. [Unrecognised attribute values should be replaced by the default value for the attribute] If it encounters an undeclared entity, the entity should be treated as character data. [Unrecognised entities should be treated as literal strings] We also recommend that user agents provide support for notifying the user of such errors. [It should be possible for a user to be informed of such errors by the UA] Since user agents may vary in how they handle error conditions, authors and users must not rely on specific error recovery behavior. [Error handling is unspecified; users may therefore not rely on consistent error-handling behavior] The HTML 2.0 specification ([RFC1866] [p.356] ) observes that many HTML 2.0 user agents assume that a document that does not begin with a document type declaration refers to the HTML 2.0 specification. As experience shows that this is a poor assumption, the current specification does not recommend this behavior. [A document that lacks a DOCTYPE must not be assumed to conform to the HTML 2.0 specification] For reasons of interoperability, authors must not "extend" HTML through the available SGML mechanisms (e.g., extending the DTD, adding a new set of entity definitions, etc.)." [No gloss needed] So, to summarise, the specification lists a small number of possible scenarios in each of which the browser is expected to perform a "best efforts" recovery. Now consider the following extract, based on ideas that have already been discussed in this thread : <b>I am bold<i>I am italic</b>I am not bold</i>I am not italic What does the HTML 4.01 specification say concerning the rendering of this fragment ? Nothing at all. The fragment is invalid, and there is therefore no discussion of how it should be parsed or rendered. Indeed, nowhere is there a recommendation that it should be rendered at all. Yet we have already learned that at least one web logging engine (that on which "blogs.sun.com" is based) outputs analogous markup : <code><pre>{ ... }</code></pre>. And a report from my namesake Philip Taylor at Cambridge states that : > HTML5 (or at least html5lib and validator.nu) currently parses > > A<code><pre>B</code></pre>C > > into > > | "A" > | <code> > | <pre> > | "B" > | "C" Thus there is more than a tacit suggestion that a conforming HTML 5 browser /will/ render "A<code><pre>B</code></pre>C" according to some well-defined set of heuristics. And this is the point at which I feel that HTML 5 (unlike HTML 4) is totally over-stepping the mark, by defining a rendering for a code fragment that is completely and utterly ill-formed. Enough. I rest my case. Philip TAYLOR
Received on Tuesday, 18 November 2008 17:22:40 UTC