Re: An HTML language specification

Mark Baker wrote:
> [...] The parser and much of the language is
> defined in DOM terms.  I haven't had a detailed enough look at the
> parser to know if the DOM gets in the way though, or if it can simply
> be used as an abstract model as the spec says ("Implementations that
> do not support scripting do not have to actually create a DOM Document
> object, but the DOM tree in such cases is still used as the model for
> the rest of the specification.").  As somebody pointed out, html5lib
> doesn't have a DOM, so that's an argument that it's possible.  But I'm
> still wary of using an implemented model as an abstract one, lest
> nuances of the various implementations result in differing
> interpretations of the specification.

The parsing algorithm says:

   "The tree construction stage is associated with a DOM Document object 
when a parser is created. The "output" of this stage consists of 
dynamically modifying or extending that document's DOM tree."

and then defines terms like "create an element for a token" in terms of 
DOM concepts like the HTMLAnchorElement interface. Then it uses phrases 
like:

   "Append a Comment node to the current node with the 'data' attribute 
set to the data given in the comment token."

and

   "Insert 'last node' into 'node', first removing it from its previous 
parent node if any."

So it seem to me that the spec is already using a quite abstract view of 
the DOM. It uses the DOM interface names to identify the different types 
of node that can be generated, and to refer to the fields of each node 
(like 'data'), but otherwise it uses generic tree terminology. In 
particular it doesn't say anything like "Execute 
node.appendChild(lastNode)", which would be much more 
DOM-implementation-specific.

People who have implemented the parsing algorithm have used a variety of 
non-DOM output structures (ElementTree and BeautifulSoup in html5lib, 
XOM and SAX in validator.nu, some purely functional tree structure in my 
OCaml implementation, etc) have never (as far as I'm aware) expressed 
concerns that the spec makes it unnecessarily difficult to use a non-DOM 
output format. (There are some necessary difficulties when the output 
format can't represent all HTML documents, e.g. if it requires 
XML-compatible element names or unbuffered streaming, but those issues 
will occur regardless of how the spec is written.)

Does this increase or assuage your wariness at all?

-- 
Philip Taylor
pjt47@cam.ac.uk

Received on Thursday, 20 November 2008 23:53:01 UTC