Tokeniser meta-implementation

There was some discussion a while ago about a more formal definition of 
HTML5, and one a couple of months ago on IRC where I thought it'd be 
nice to have a model of the HTML5 parser in an easily processable form 
(instead of the English prose in the spec and the 
Python/Ruby/Java/PHP/etc in existing implementations). I recently had 
some time to look into it and implemented that idea for the tokeniser 
algorithm, and the result seems to be promising - I think it is 
reasonably useful for implementing, visualising, testing and verifying 
the algorithm.


I translated the tokeniser description from the spec into a data 
structure written in OCaml, built around a set of simple operations ("is 
consumed character equal to EOF", "emit current tag token", etc). It 
could probably be considered an actual formal definition in a 
mathematical sense (or at least trivially mapped onto one), and it can 
act as a specification to be refined into executable code.

 From that translation, a number of things are made possible:

* Each of the operations has an implementation as an OCaml function, and 
the algorithm description controls how they are connected together, to 
create an (inefficient) OCaml tokeniser implementation.

* I also implemented (manually) each operation as a C++ function, and 
can automatically convert the OCaml algorithm representation into a new 
piece of C++ code that runs the state machine and calls those functions, 
resulting in a reasonably efficient tokeniser implementation. (It's a 
few hundred times faster than the Python html5lib tokeniser, and maybe a 
couple of times faster than Henri's Java one depending on how you 
measure it). That process also does some simple optimisation of the code 
(e.g. merging partially-redundant 'if' statements that were introduced 
by some earlier abstractions, or recognising where conditions can be 
written with 'switch' instead).

* I also did the same to create JavaScript and Perl implementations of 
the tokeniser - the JS one is usable at 
<http://canvex.lazyilluminati.com/misc/parser/tokeniser.html>, which 
almost works correctly except for some minor Unicode issues - and it 
should be straightforward to produce basic implementations in any other 
language in the same way. A single change to the original OCaml code 
will update all of the different language implementations at once 
(although some other types of update would still require some manual 
work in each language).

* It is fairly easy to do global transformations of the code: the data 
at <http://canvex.lazyilluminati.com/misc/stats/tokeniser.html> was 
generated by automatically inserting C++ code to count each step of the 
algorithm, and the parse error messages of the form "[unexpected] 
character [...] in [...]State" in 
<http://canvex.lazyilluminati.com/survey/2007-07-17/analyse.cgi/index#parse-errors> 
were added automatically where the specification simply says "Parse error."

* The data can be used to build various types of state transition 
diagram: <http://canvex.lazyilluminati.com/misc/tokeniser_states.png> 
shows all possible transitions through the standard state machine 
(modified to use a special data state after reading EOF, for clarity), 
with red for paths that are parse errors and blue for paths that occur 
only on EOF.

* Some types of test case can be generated automatically: I can run the 
OCaml implementation of the tokeniser on various inputs, and when it 
reaches the end of the input stream the program will examine its current 
state and find 'interesting' next input characters (based on what the 
next step of the algorithm will be looking for), to build up a list of 
inputs that cover every transition between states, without an excessive 
amount of time spent on redundant inputs. I built a list of about ten 
thousand cases from there, which found some minor bugs in other 
tokeniser implementations.

* Various properties of the algorithm can be automatically verified: I 
have a kind of data-flow type-analysis algorithm which ensures that e.g. 
if the "append the consumed character to the current tag token's current 
attribute's name" operation is executed then the current token is 
definitely a tag token which has at least one attribute; and there are 
never two "reconsume the current input character" without a "consume the 
next input character" between them; and some other similar forms of 
'correctness', which allow implementations to make optimisations safely 
(e.g. not bothering to check the current token's type). These should all 
be easily adaptable to small changes in the spec, and they should make 
it obvious if any spec change makes the properties no longer hold.


The code for all of this is currently at 
<http://canvex.lazyilluminati.com/svn/tokeniser/>, though I ought to 
move it somewhere proper if anyone is actually interested in it.

I'd like to see where else this kind of work could lead, and if anyone 
has ideas of useful directions. My current plan is to extend it to the 
tree construction part of the parser, which looks more complex but 
hopefully still feasible - I'm not sure whether I'll have time to work 
on it soon, but I believe it would be worth seeing if it can be made to 
work.

-- 
Philip Taylor
philip@zaynar.demon.co.uk

Received on Saturday, 28 July 2007 17:58:09 UTC