- From: Philip Taylor <philip@zaynar.demon.co.uk>
- Date: Sat, 28 Jul 2007 18:57:58 +0100
- To: public-html@w3.org
There was some discussion a while ago about a more formal definition of HTML5, and one a couple of months ago on IRC where I thought it'd be nice to have a model of the HTML5 parser in an easily processable form (instead of the English prose in the spec and the Python/Ruby/Java/PHP/etc in existing implementations). I recently had some time to look into it and implemented that idea for the tokeniser algorithm, and the result seems to be promising - I think it is reasonably useful for implementing, visualising, testing and verifying the algorithm. I translated the tokeniser description from the spec into a data structure written in OCaml, built around a set of simple operations ("is consumed character equal to EOF", "emit current tag token", etc). It could probably be considered an actual formal definition in a mathematical sense (or at least trivially mapped onto one), and it can act as a specification to be refined into executable code. From that translation, a number of things are made possible: * Each of the operations has an implementation as an OCaml function, and the algorithm description controls how they are connected together, to create an (inefficient) OCaml tokeniser implementation. * I also implemented (manually) each operation as a C++ function, and can automatically convert the OCaml algorithm representation into a new piece of C++ code that runs the state machine and calls those functions, resulting in a reasonably efficient tokeniser implementation. (It's a few hundred times faster than the Python html5lib tokeniser, and maybe a couple of times faster than Henri's Java one depending on how you measure it). That process also does some simple optimisation of the code (e.g. merging partially-redundant 'if' statements that were introduced by some earlier abstractions, or recognising where conditions can be written with 'switch' instead). * I also did the same to create JavaScript and Perl implementations of the tokeniser - the JS one is usable at <http://canvex.lazyilluminati.com/misc/parser/tokeniser.html>, which almost works correctly except for some minor Unicode issues - and it should be straightforward to produce basic implementations in any other language in the same way. A single change to the original OCaml code will update all of the different language implementations at once (although some other types of update would still require some manual work in each language). * It is fairly easy to do global transformations of the code: the data at <http://canvex.lazyilluminati.com/misc/stats/tokeniser.html> was generated by automatically inserting C++ code to count each step of the algorithm, and the parse error messages of the form "[unexpected] character [...] in [...]State" in <http://canvex.lazyilluminati.com/survey/2007-07-17/analyse.cgi/index#parse-errors> were added automatically where the specification simply says "Parse error." * The data can be used to build various types of state transition diagram: <http://canvex.lazyilluminati.com/misc/tokeniser_states.png> shows all possible transitions through the standard state machine (modified to use a special data state after reading EOF, for clarity), with red for paths that are parse errors and blue for paths that occur only on EOF. * Some types of test case can be generated automatically: I can run the OCaml implementation of the tokeniser on various inputs, and when it reaches the end of the input stream the program will examine its current state and find 'interesting' next input characters (based on what the next step of the algorithm will be looking for), to build up a list of inputs that cover every transition between states, without an excessive amount of time spent on redundant inputs. I built a list of about ten thousand cases from there, which found some minor bugs in other tokeniser implementations. * Various properties of the algorithm can be automatically verified: I have a kind of data-flow type-analysis algorithm which ensures that e.g. if the "append the consumed character to the current tag token's current attribute's name" operation is executed then the current token is definitely a tag token which has at least one attribute; and there are never two "reconsume the current input character" without a "consume the next input character" between them; and some other similar forms of 'correctness', which allow implementations to make optimisations safely (e.g. not bothering to check the current token's type). These should all be easily adaptable to small changes in the spec, and they should make it obvious if any spec change makes the properties no longer hold. The code for all of this is currently at <http://canvex.lazyilluminati.com/svn/tokeniser/>, though I ought to move it somewhere proper if anyone is actually interested in it. I'd like to see where else this kind of work could lead, and if anyone has ideas of useful directions. My current plan is to extend it to the tree construction part of the parser, which looks more complex but hopefully still feasible - I'm not sure whether I'll have time to work on it soon, but I believe it would be worth seeing if it can be made to work. -- Philip Taylor philip@zaynar.demon.co.uk
Received on Saturday, 28 July 2007 17:58:09 UTC