Re: Comments on HTML WG face to face meetings in France Oct 08 from Philip TAYLOR on 2008-11-18 (www-tag@w3.org from November 2008)

From: Philip TAYLOR <P.Taylor@Rhul.Ac.Uk>
Date: Tue, 18 Nov 2008 17:21:27 +0000
To: Ian Hickson <ian@hixie.ch>
CC: Elliotte Harold <elharo@metalab.unc.edu>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, noah_mendelsohn@us.ibm.com, public-html <public-html@w3.org>, www-tag@w3.org
Message-ID: <4922F997.60004@Rhul.Ac.Uk>
Ian Hickson wrote:

> How is this different from what HTML4 did? HTML4 said "this is what is 
> valid, and everything else should work too". 

and later clarified this by citing various section of the
HTML 4.01 specification :

>> 6.16 paragraph 1 sentence 2, 
>> section 7.2 second bullet point of the first note, 
>> last paragraph of section 9.3.1, 
>> last paragraph of the definition of the "span" attribute in section 11.2.4, 
>> third paragraph of section 16.2, 
>> definition of the "checked" attribute in section 17.4, 
>> and, probably most importantly, section B.1. 

>> To name but a few.

So let's see how the cited evidence supports the claim, remembering
that the claim is that "/everything/ else should work too", not that
"some things that are not explicitly specified should work too"

>> 6.16 paragraph 1 sentence 2, 

"User agents should ignore all other target names."

In other words, target names that are not explicitly mentioned
in the spec. should be ignored.

>> section 7.2 second bullet point of the first note, 

"Software conforming to the DTDs of the present specification may
  ignore features of future HTML 4 DTDs that it does not recognize."

IOW, a browser that can read and process DTDs may safely ignore
features specified in DTDs more recent that the DTD to which the
specification refers.

>> last paragraph of section 9.3.1, 

"We discourage authors from using empty P elements.
  User agents should ignore empty P elements."

IOW, empty P elements are legal and should be ignored.

>> last paragraph of the definition of the "span" attribute in section 11.2.4, 

"User agents must ignore this attribute if the
  COLGROUP element contains one or more COL elements."

IOW, there exists at least one context in which the
SPAN attribute should be ignored.

>> third paragraph of section 16.2, 

"Elements that might normally be placed in the
  BODY element must not appear before the first
  FRAMESET element or the FRAMESET will be ignored."

IOW, FRAMESET must precede any other elements that
would normally be placed in the BODY element; if
this constraint is violated, the FRAMESET will be
ignored.

>> definition of the "checked" attribute in section 17.4, 

"checked [CI] [p.49]
  When the type attribute has the value "radio" or "checkbox",
  this boolean attribute specifies that the button is on.
  User agents must ignore this attribute for other control types."

IOW, if the attribute "checked" is used in contexts where
the corresponding "type" attribute has neither the value
"radio" nor "checkbox", the "checked" attrbute must be
ignored.

>> and, probably most importantly, section B.1. 

[Note : this is a long extract -- I hope that reproducing
  it here does not violated W3C copyright -- my comments in
  square brackets]

"B.1 Notes on invalid documents

  This specification does not define how conforming user agents handle general error
  conditions, including how user agents behave when they encounter elements,
  attributes, attribute values, or entities not specified in this document.
  However, to facilitate experimentation and interoperability between
  implementations of various versions of HTML, we recommend the following
  behavior:

  If a user agent encounters an element it does not recognize, it should try to
  render the element’s content.

[Unrecognised elements should be rendered as if the surrounding tags were absent]

  If a user agent encounters an attribute it does not recognize, it should ignore the
  entire attribute specification (i.e., the attribute and its value).

[Unrecognised attributes must be ignored along with their values]

  If a user agent encounters an attribute value it doesn’t recognize, it should use
  the default attribute value.

[Unrecognised attribute values should be replaced by the default value for the attribute]

  If it encounters an undeclared entity, the entity should be treated as character
  data.

[Unrecognised entities should be treated as literal strings]

  We also recommend that user agents provide support for notifying the user of
  such errors.

[It should be possible for a user to be informed of such errors by the UA]

  Since user agents may vary in how they handle error conditions, authors and
  users must not rely on specific error recovery behavior.

[Error handling is unspecified; users may therefore not rely on consistent
  error-handling behavior]

  The HTML 2.0 specification ([RFC1866] [p.356] ) observes that many HTML 2.0
  user agents assume that a document that does not begin with a document type
  declaration refers to the HTML 2.0 specification. As experience shows that this is a
  poor assumption, the current specification does not recommend this behavior.

[A document that lacks a DOCTYPE must not be assumed to conform to the HTML 2.0
  specification]

  For reasons of interoperability, authors must not "extend" HTML through the
  available SGML mechanisms (e.g., extending the DTD, adding a new set of entity
  definitions, etc.)."

[No gloss needed]


So, to summarise, the specification lists a small number of possible
scenarios in each of which the browser is expected to perform
a "best efforts" recovery.  Now consider the following extract,
based on ideas that have already been discussed in this thread :

	<b>I am bold<i>I am italic</b>I am not bold</i>I am not italic

What does the HTML 4.01 specification say concerning the rendering of this fragment ?
Nothing at all.  The fragment is invalid, and there is therefore no
discussion of how it should be parsed or rendered.  Indeed, nowhere is
there a recommendation that it should be rendered at all.

Yet we have already learned that at least one web logging engine
(that on which "blogs.sun.com" is based) outputs analogous markup :
<code><pre>{  ... }</code></pre>.  And a report from my namesake
Philip Taylor at Cambridge states that :

> HTML5 (or at least html5lib and validator.nu) currently parses
> 
>   A<code><pre>B</code></pre>C
> 
> into
> 
>   |     "A"
>   |     <code>
>   |       <pre>
>   |         "B"
>   |       "C" 

Thus there is more than a tacit suggestion that a conforming
HTML 5 browser /will/ render "A<code><pre>B</code></pre>C"
according to some well-defined set of heuristics.

And this is the point at which I feel that HTML 5 (unlike HTML 4)
is totally over-stepping the mark, by defining a rendering for a
code fragment that is completely and utterly ill-formed.  Enough.
I rest my case.

Philip TAYLOR
Received on Tuesday, 18 November 2008 17:22:37 UTC