Other syntax: part of my review of 8 The HTML syntax from Robert Burns on 2007-08-14 (public-html@w3.org from August 2007)

From: Robert Burns <rob@robburns.com>
Date: Tue, 14 Aug 2007 05:14:56 -0500
To: HTMLWG WG <public-html@w3.org>
Message-Id: <54DFD15E-C8EA-445C-A674-8611405F77CD@robburns.com>
Summary
--------------------------------
  • Rework the HTML syntax to be about HTML in the broadest sense of  
the term
  • Use HTML to refer to the broadest sense of HTML and 'text/html'  
to refer to the serialization
  • Move the syntax chapter to chapter two (2).
  • Specify error-handling for the XML serialization with respect to  
misused '&', "<' and unknown character entities

Broadening this chapter
--------------------------------
Echoing  and adapting a suggestion made by Ben Boyle [1], I'd like to  
see this chapter broadened to include not just the syntax for the  
text/html serialization, but HTML syntax in general. This could look  
something like:

8. The HTML syntax
8.1 Expressing HTML data types (for expressing booleans, numbers,  
dates and times: especially where they are independent of syntax )
8.2 XML based HTML syntax
8.3 XML based HTML syntax
8.3.1. Writing HTML documents
8.3.2. Parsing HTML documents in the text/html serialization
8.3.3. Namespaces in the text/html serialization (this might be  
generalizable too depending on how we handle the distributed  
extensibility issue[2])
8.4. Entities (which are a part of both serializations)

The order could be changed if we wanted to make the text/html  
serialization first; followed by the xml serialization; followed by  
the serialization independent syntax.

Use HTML in the broad sense: 'text/html' to name the serialization..
--------------------------------
Make the term HTML throughout the draft apply to HTML in the broad  
sense. Use 'text/html' or another term to refer to the particular  
serialization of HTML. Use HTML DOM or just DOM in the draft to refer  
specifically to the DOM. Use XHTML or XML to refer to the XML  
serialization of HTML. HTML is so much more about that big picture  
than it is about any particular serialization of it. We could make a  
binary serialization of HTML and it would still be HTML (IMHO).

Moving this chapter earlier
--------------------------------
While this may not be a chapter of general interest to everyone  
reading the recommendation, it strikes me that it covers topics that  
should logically occur before chapter 3.  I think by moving this to  
chapter 2, moving the DOM to chapter 3 or later and then making the  
chapter on semantics come after this chapter, we could then take all  
of these issues as read.

My sense is that a consensus is building in the group to use a  
generalized XML-like syntax for the examples and illustrations in the  
semantics chapter. By providing this syntax and serialization  
material up front, a reader reading from the beginning will be  
totally up to speed.

Entity handling in XML serialization
--------------------------------

Since XML leaves open the issue of error--handling for character  
entities, I propose that the HTML5 recommendation steps up to provide  
more guidance and norms on this issue. The major browsers mostly  
deploy a very draconian error-handling when parsing HTML as XML: more  
so than the XML recommendation calls for. In this situation, I think  
the HTML5 recommendation could step in as an XML application and more  
thoroughly define how XHTML UAs should handle stray ampersand '&'  
and  '<' characters as well as unknown character entities.

For example, I think it might make sense for us to require HTML5 UAs  
to process HTML5 de-serialized from XML to process '&' and '<'  
characters in the same way as the character entities in the text/html  
serialization. That is if the entity is only missing a trailing semi- 
colon in certain cases, go ahead and treat it as that character  
entity. If it doesn't fall into that case, treat the ampersand as &.

For the issue of unknown entities, where the character entity has the  
right syntax for a character entity, but is not defined by HTML, we  
could recommend those entities be mapped to the Unicode replacement  
character (U+FFFD.

In my experience a large proportion of the XHTML errors I encounter  
(where the browser refuses to display the page) are due to this extra- 
draconian error handling. By specifying another more lenient and XML  
conforming behavior we will smooth over the use of XHTML. The  
browsers may even adopt this error-handling for the other XHTMLs too.

Clearly, the XML recommendation left open this possibility for good  
reasons[3]. The issues or ill-formedness are not really effected by  
incorrectly used '&' or '<' characters (at least not directly) nor by  
unknown character entities (especially when the UA isn't retrieving  
the schema for the doctype)).


[1]: <http://lists.w3.org/Archives/Public/public-html/2007Jul/ 
1123.html> and
<http://lists.w3.org/Archives/Public/public-html/2007Jul/1125.html> and
<http://lists.w3.org/Archives/Public/public-html/2007Jul/1167.html>
[2]: <http://lists.w3.org/Archives/Public/public-html/2007Aug/0134.html>
[3]: I recently filed bugs on these XML processing issues with WebKit:
<http://bugs.webkit.org/show_bug.cgi?id=14952> and
<http://bugs.webkit.org/show_bug.cgi?id=14945>
Received on Tuesday, 14 August 2007 10:15:38 UTC