[whatwg] Potentially avoidable tokeniser/treebuilder dependency from Øistein E. Andersen on 2009-09-22 (public-whatwg-archive@w3.org from September 2009)

From: Øistein E. Andersen <liszt@coq.no>
Date: Wed, 23 Sep 2009 00:01:04 +0100
Message-ID: <074D8588-D141-4B33-8A64-BAA3B259B7E6@coq.no>

As currently specified, the tokeniser is mostly, but not completely,  
independent of the treebuiilder.

The major obstacle for an independent tokeniser seems to be that the  
content model flag is set to RCDATA, RAWTEXT or PLAINTEXT by the  
treebuilder and not by the tokeniser. In most cases, the new content  
model flag is entirely predictable from the start tag (and RCDATA/ 
RAWTEXT element names are known to the tokeniser already).  The only  
exceptions I have found so far concern start tags within <select> and  
<frameset>, which are dropped by the treebuilder and therefore do not  
cause the content model flag to change.  Even these cases could  
perhaps have been handled by the tokeniser without too much trouble  
(and without changing the spec) if it were not for the "in select in  
table" insertion mode, where a missing </select> end tag may be  
inferred depending on the stack of open elements.

It seems unfortunate to abandon the possibility of an independent  
tokeniser just to handle what appears to be a corner case of a corner  
case, viz, unclosed RCDATA/RAWTEXT elements inside an unclosed  
<select> element in a table.  The easiest solution would be to switch  
the content model flag upon seeing an RCDATA/RAWTEXT/PLAINTEXT start  
tag irrespective of insertion mode, i.e., also within <select> and  
<frameset>, which would allow the tokeniser to take care of this  
without added complexity.  Other solutions might be worth considering  
if this is found to be too incompatible with existing pages.  (I could  
have a look at the the http://www.dotnetdotcom.org/ dataset if that  
would be of any use.)

(A tiny bit of context: I recently implemented most of the tokeniser  
in lex in the view of using it as a tool to investigate the use of  
named character references in existing documents.  It uses about 20  
start conditions instead of the spec's 39 states and two flags, is  
fairly compact and readable (500 lines compared to 5,500 in the  
Validator.nu implementation), and runs about 35 times faster than the  
full Validator.nu HTML Parser (both under highly suboptimal  
conditions).  Unfortunately, it is of little use without a treebuilder  
to set the content model flag.  It has been pointed out that use cases  
for which a tree is not needed may not require perfect tokenisation;  
even if that be true, it is much more difficult to assure that an  
approximate implementation is sufficiently close than to follow the  
specification; perhaps more importantly, removing unnecessary  
dependencies and allowing the tokeniser to run on its own would also  
make it easier to develop and test a tokeniser for use as part of a  
full parser.)

-- 
?istein E. Andersen

Received on Tuesday, 22 September 2009 16:01:04 UTC