Re: Other syntax: part of my review of 8 The HTML syntax from Robert Burns on 2007-08-16 (public-html@w3.org from August 2007)

From: Robert Burns <rob@robburns.com>
Date: Wed, 15 Aug 2007 22:52:12 -0500
To: Robert Burns <rob@robburns.com>
Cc: HTMLWG WG <public-html@w3.org>
Message-Id: <BF64D7BF-9186-4CF1-B118-C57E81D754DF@robburns.com>
On Aug 14, 2007, at 5:14 AM, Robert Burns wrote:

>
>
> Summary
> --------------------------------
>  • Rework the HTML syntax to be about HTML in the broadest sense of  
> the term
>  • Use HTML to refer to the broadest sense of HTML and 'text/html'  
> to refer to the serialization
>  • Move the syntax chapter to chapter two (2).
>  • Specify error-handling for the XML serialization with respect to  
> misused '&', "<' and unknown character entities

On half of this fourth bullet-point, some further research revealed I  
was wrong about that (thank to Maciej in part for pointing that out;  
though in the usual obscure way:-). While stray '&' and '<' are not  
fatal-errors on their own in XML, they inherently lead to well- 
formedness constraint violations and therefore eventually lead to  
fatal-errors. While this is an issue creates some wiggle room an  
implementation could push without much consequence (since character  
references do not create trees so the negative impact is quite  
localized), it would be unwise for an HTML recommendation to  
encourage that sort of thing.

The fourth bullet-point should probably read:
  • Specify error-handling for the XML serialization with respect to  
unknown character entities

The other points remain viable. In particular by specifying that XML  
processed HTML5 documents should not throw up error pages when  
encountering an unknown character reference (like &madeupreference;),  
the current trends among implementations is to treat that as a fatal  
error and therefore needless breaks many web pages. If we could  
address that it would be a big deal.


Take care,
Rob

remainder of original message
-----------------------------------------
> Broadening this chapter
> --------------------------------
> Echoing  and adapting a suggestion made by Ben Boyle [1], I'd like  
> to see this chapter broadened to include not just the syntax for  
> the text/html serialization, but HTML syntax in general. This could  
> look something like:
>
> 8. The HTML syntax
> 8.1 Expressing HTML data types (for expressing booleans, numbers,  
> dates and times: especially where they are independent of syntax )
> 8.2 XML based HTML syntax
> 8.3 XML based HTML syntax
> 8.3.1. Writing HTML documents
> 8.3.2. Parsing HTML documents in the text/html serialization
> 8.3.3. Namespaces in the text/html serialization (this might be  
> generalizable too depending on how we handle the distributed  
> extensibility issue[2])
> 8.4. Entities (which are a part of both serializations)
>
> The order could be changed if we wanted to make the text/html  
> serialization first; followed by the xml serialization; followed by  
> the serialization independent syntax.
>
> Use HTML in the broad sense: 'text/html' to name the serialization..
> --------------------------------
> Make the term HTML throughout the draft apply to HTML in the broad  
> sense. Use 'text/html' or another term to refer to the particular  
> serialization of HTML. Use HTML DOM or just DOM in the draft to  
> refer specifically to the DOM. Use XHTML or XML to refer to the XML  
> serialization of HTML. HTML is so much more about that big picture  
> than it is about any particular serialization of it. We could make  
> a binary serialization of HTML and it would still be HTML (IMHO).
>
> Moving this chapter earlier
> --------------------------------
> While this may not be a chapter of general interest to everyone  
> reading the recommendation, it strikes me that it covers topics  
> that should logically occur before chapter 3.  I think by moving  
> this to chapter 2, moving the DOM to chapter 3 or later and then  
> making the chapter on semantics come after this chapter, we could  
> then take all of these issues as read.
>
> My sense is that a consensus is building in the group to use a  
> generalized XML-like syntax for the examples and illustrations in  
> the semantics chapter. By providing this syntax and serialization  
> material up front, a reader reading from the beginning will be  
> totally up to speed.
>
> Entity handling in XML serialization
> --------------------------------
>
> Since XML leaves open the issue of error--handling for character  
> entities, I propose that the HTML5 recommendation steps up to  
> provide more guidance and norms on this issue. The major browsers  
> mostly deploy a very draconian error-handling when parsing HTML as  
> XML: more so than the XML recommendation calls for. In this  
> situation, I think the HTML5 recommendation could step in as an XML  
> application and more thoroughly define how XHTML UAs should handle  
> stray ampersand '&' and  '<' characters as well as unknown  
> character entities.
>
> For example, I think it might make sense for us to require HTML5  
> UAs to process HTML5 de-serialized from XML to process '&' and '<'  
> characters in the same way as the character entities in the text/ 
> html serialization. That is if the entity is only missing a  
> trailing semi-colon in certain cases, go ahead and treat it as that  
> character entity. If it doesn't fall into that case, treat the  
> ampersand as &.
>
> For the issue of unknown entities, where the character entity has  
> the right syntax for a character entity, but is not defined by  
> HTML, we could recommend those entities be mapped to the Unicode  
> replacement character (U+FFFD.
>
> In my experience a large proportion of the XHTML errors I encounter  
> (where the browser refuses to display the page) are due to this  
> extra-draconian error handling. By specifying another more lenient  
> and XML conforming behavior we will smooth over the use of XHTML.  
> The browsers may even adopt this error-handling for the other  
> XHTMLs too.
>
> Clearly, the XML recommendation left open this possibility for good  
> reasons[3]. The issues or ill-formedness are not really effected by  
> incorrectly used '&' or '<' characters (at least not directly) nor  
> by unknown character entities (especially when the UA isn't  
> retrieving the schema for the doctype)).
>
>
> [1]: <http://lists.w3.org/Archives/Public/public-html/2007Jul/ 
> 1123.html> and
> <http://lists.w3.org/Archives/Public/public-html/2007Jul/1125.html>  
> and
> <http://lists.w3.org/Archives/Public/public-html/2007Jul/1167.html>
> [2]: <http://lists.w3.org/Archives/Public/public-html/2007Aug/ 
> 0134.html>
> [3]: I recently filed bugs on these XML processing issues with WebKit:
> <http://bugs.webkit.org/show_bug.cgi?id=14952> and
> <http://bugs.webkit.org/show_bug.cgi?id=14945>
Received on Thursday, 16 August 2007 03:52:18 UTC