- From: Robert Burns <rob@robburns.com>
- Date: Tue, 14 Aug 2007 05:14:56 -0500
- To: HTMLWG WG <public-html@w3.org>
Summary -------------------------------- • Rework the HTML syntax to be about HTML in the broadest sense of the term • Use HTML to refer to the broadest sense of HTML and 'text/html' to refer to the serialization • Move the syntax chapter to chapter two (2). • Specify error-handling for the XML serialization with respect to misused '&', "<' and unknown character entities Broadening this chapter -------------------------------- Echoing and adapting a suggestion made by Ben Boyle [1], I'd like to see this chapter broadened to include not just the syntax for the text/html serialization, but HTML syntax in general. This could look something like: 8. The HTML syntax 8.1 Expressing HTML data types (for expressing booleans, numbers, dates and times: especially where they are independent of syntax ) 8.2 XML based HTML syntax 8.3 XML based HTML syntax 8.3.1. Writing HTML documents 8.3.2. Parsing HTML documents in the text/html serialization 8.3.3. Namespaces in the text/html serialization (this might be generalizable too depending on how we handle the distributed extensibility issue[2]) 8.4. Entities (which are a part of both serializations) The order could be changed if we wanted to make the text/html serialization first; followed by the xml serialization; followed by the serialization independent syntax. Use HTML in the broad sense: 'text/html' to name the serialization.. -------------------------------- Make the term HTML throughout the draft apply to HTML in the broad sense. Use 'text/html' or another term to refer to the particular serialization of HTML. Use HTML DOM or just DOM in the draft to refer specifically to the DOM. Use XHTML or XML to refer to the XML serialization of HTML. HTML is so much more about that big picture than it is about any particular serialization of it. We could make a binary serialization of HTML and it would still be HTML (IMHO). Moving this chapter earlier -------------------------------- While this may not be a chapter of general interest to everyone reading the recommendation, it strikes me that it covers topics that should logically occur before chapter 3. I think by moving this to chapter 2, moving the DOM to chapter 3 or later and then making the chapter on semantics come after this chapter, we could then take all of these issues as read. My sense is that a consensus is building in the group to use a generalized XML-like syntax for the examples and illustrations in the semantics chapter. By providing this syntax and serialization material up front, a reader reading from the beginning will be totally up to speed. Entity handling in XML serialization -------------------------------- Since XML leaves open the issue of error--handling for character entities, I propose that the HTML5 recommendation steps up to provide more guidance and norms on this issue. The major browsers mostly deploy a very draconian error-handling when parsing HTML as XML: more so than the XML recommendation calls for. In this situation, I think the HTML5 recommendation could step in as an XML application and more thoroughly define how XHTML UAs should handle stray ampersand '&' and '<' characters as well as unknown character entities. For example, I think it might make sense for us to require HTML5 UAs to process HTML5 de-serialized from XML to process '&' and '<' characters in the same way as the character entities in the text/html serialization. That is if the entity is only missing a trailing semi- colon in certain cases, go ahead and treat it as that character entity. If it doesn't fall into that case, treat the ampersand as &. For the issue of unknown entities, where the character entity has the right syntax for a character entity, but is not defined by HTML, we could recommend those entities be mapped to the Unicode replacement character (U+FFFD. In my experience a large proportion of the XHTML errors I encounter (where the browser refuses to display the page) are due to this extra- draconian error handling. By specifying another more lenient and XML conforming behavior we will smooth over the use of XHTML. The browsers may even adopt this error-handling for the other XHTMLs too. Clearly, the XML recommendation left open this possibility for good reasons[3]. The issues or ill-formedness are not really effected by incorrectly used '&' or '<' characters (at least not directly) nor by unknown character entities (especially when the UA isn't retrieving the schema for the doctype)). [1]: <http://lists.w3.org/Archives/Public/public-html/2007Jul/ 1123.html> and <http://lists.w3.org/Archives/Public/public-html/2007Jul/1125.html> and <http://lists.w3.org/Archives/Public/public-html/2007Jul/1167.html> [2]: <http://lists.w3.org/Archives/Public/public-html/2007Aug/0134.html> [3]: I recently filed bugs on these XML processing issues with WebKit: <http://bugs.webkit.org/show_bug.cgi?id=14952> and <http://bugs.webkit.org/show_bug.cgi?id=14945>
Received on Tuesday, 14 August 2007 10:15:38 UTC