- From: Philip Taylor <pjt47@cam.ac.uk>
- Date: Tue, 27 Jan 2009 10:51:58 +0000
- To: www-html@w3.org, Giovanni Campagna <scampa.giovanni@gmail.com>
Giovanni Campagna wrote: >>> I asked a different question: why an author that doesn't rely on script >>> (or an implementation that cannot, for various reason, implement >>> scripts) should learn a plenty of DOM interfaces and APIs? >>> >> >> DOM is the abstract model that serializations express. So if an >> implementation is parsing a serialization, it's producing a DOM, regardless >> of whether it supports scripting. > > What about SAX parsers? They don't build any DOM. An implementation is > required to build an Infoset (abstract concept), not a DOM (a set of objects > implementing certain interfaces) HTML5 only requires that implementations act the same as if they were producing a DOM - it doesn't require that they actually do produce a DOM internally. (It specifically says "Note: Implementations that do not support scripting do not have to actually create a DOM Document object, but the DOM tree in such cases is still used as the model for the rest of the specification.") You can write a streaming SAX parser for HTML5 without buffering anything into a tree, as long as you treat some errors as fatal (e.g. "<table>foo" is non-streamable because the text "foo" comes before the <table> in the parsed document). If you don't hit a non-streamable error, the output from the SAX parser has to be equivalent to what you'd get by parsing into a DOM and then emitting it as SAX, but there's no need to actually create a DOM. The parser algorithm uses phrases like "Append a Comment node to the Document object with the data attribute set to the data given in the comment token.", which are fairly high-level (it's not saying e.g. "document.appendChild(document.createComment(token.data))") and easy to understand in terms of any tree model, and don't require a detailed knowledge of DOM. So the DOM is being used largely as an abstract concept and not as a set of objects. Since scripting relies on the DOM, the spec has to define how to get a DOM from a serialised document (and how to handle e.g. scripts mutating the document while it's being parsed), and that's much easier if the parser's abstract model is the DOM instead of using some other model that has to be explicitly mapped onto the DOM implementation. As far as I'm aware, implementers of non-scripted parsers have not had any problems mapping the concepts onto different output formats (html5lib has several tree formats, Validator.nu has XOM and SAX, etc), so it seems to work fine in practice. -- Philip Taylor pjt47@cam.ac.uk
Received on Tuesday, 27 January 2009 10:52:34 UTC