- From: Philip Taylor <philip@zaynar.demon.co.uk>
- Date: Sat, 28 Jul 2007 18:57:58 +0100
- To: public-html@w3.org
There was some discussion a while ago about a more formal definition of
HTML5, and one a couple of months ago on IRC where I thought it'd be
nice to have a model of the HTML5 parser in an easily processable form
(instead of the English prose in the spec and the
Python/Ruby/Java/PHP/etc in existing implementations). I recently had
some time to look into it and implemented that idea for the tokeniser
algorithm, and the result seems to be promising - I think it is
reasonably useful for implementing, visualising, testing and verifying
the algorithm.
I translated the tokeniser description from the spec into a data
structure written in OCaml, built around a set of simple operations ("is
consumed character equal to EOF", "emit current tag token", etc). It
could probably be considered an actual formal definition in a
mathematical sense (or at least trivially mapped onto one), and it can
act as a specification to be refined into executable code.
From that translation, a number of things are made possible:
* Each of the operations has an implementation as an OCaml function, and
the algorithm description controls how they are connected together, to
create an (inefficient) OCaml tokeniser implementation.
* I also implemented (manually) each operation as a C++ function, and
can automatically convert the OCaml algorithm representation into a new
piece of C++ code that runs the state machine and calls those functions,
resulting in a reasonably efficient tokeniser implementation. (It's a
few hundred times faster than the Python html5lib tokeniser, and maybe a
couple of times faster than Henri's Java one depending on how you
measure it). That process also does some simple optimisation of the code
(e.g. merging partially-redundant 'if' statements that were introduced
by some earlier abstractions, or recognising where conditions can be
written with 'switch' instead).
* I also did the same to create JavaScript and Perl implementations of
the tokeniser - the JS one is usable at
<http://canvex.lazyilluminati.com/misc/parser/tokeniser.html>, which
almost works correctly except for some minor Unicode issues - and it
should be straightforward to produce basic implementations in any other
language in the same way. A single change to the original OCaml code
will update all of the different language implementations at once
(although some other types of update would still require some manual
work in each language).
* It is fairly easy to do global transformations of the code: the data
at <http://canvex.lazyilluminati.com/misc/stats/tokeniser.html> was
generated by automatically inserting C++ code to count each step of the
algorithm, and the parse error messages of the form "[unexpected]
character [...] in [...]State" in
<http://canvex.lazyilluminati.com/survey/2007-07-17/analyse.cgi/index#parse-errors>
were added automatically where the specification simply says "Parse error."
* The data can be used to build various types of state transition
diagram: <http://canvex.lazyilluminati.com/misc/tokeniser_states.png>
shows all possible transitions through the standard state machine
(modified to use a special data state after reading EOF, for clarity),
with red for paths that are parse errors and blue for paths that occur
only on EOF.
* Some types of test case can be generated automatically: I can run the
OCaml implementation of the tokeniser on various inputs, and when it
reaches the end of the input stream the program will examine its current
state and find 'interesting' next input characters (based on what the
next step of the algorithm will be looking for), to build up a list of
inputs that cover every transition between states, without an excessive
amount of time spent on redundant inputs. I built a list of about ten
thousand cases from there, which found some minor bugs in other
tokeniser implementations.
* Various properties of the algorithm can be automatically verified: I
have a kind of data-flow type-analysis algorithm which ensures that e.g.
if the "append the consumed character to the current tag token's current
attribute's name" operation is executed then the current token is
definitely a tag token which has at least one attribute; and there are
never two "reconsume the current input character" without a "consume the
next input character" between them; and some other similar forms of
'correctness', which allow implementations to make optimisations safely
(e.g. not bothering to check the current token's type). These should all
be easily adaptable to small changes in the spec, and they should make
it obvious if any spec change makes the properties no longer hold.
The code for all of this is currently at
<http://canvex.lazyilluminati.com/svn/tokeniser/>, though I ought to
move it somewhere proper if anyone is actually interested in it.
I'd like to see where else this kind of work could lead, and if anyone
has ideas of useful directions. My current plan is to extend it to the
tree construction part of the parser, which looks more complex but
hopefully still feasible - I'm not sure whether I'll have time to work
on it soon, but I believe it would be worth seeing if it can be made to
work.
--
Philip Taylor
philip@zaynar.demon.co.uk
Received on Saturday, 28 July 2007 17:58:09 UTC