- From: James Graham <jg307@cam.ac.uk>
- Date: Sat, 05 May 2007 12:45:24 +0100
- To: Gareth Hay <gazhay@gmail.com>
- Cc: Maciej Stachowiak <mjs@apple.com>, Chris Wilson <Chris.Wilson@microsoft.com>, Jeff Schiller <codedread@gmail.com>, "matt@builtfromsource.com" <matt@builtfromsource.com>, "public-html@w3.org" <public-html@w3.org>
Gareth Hay wrote: >> The spec describes what to do with every possible stream of input >> characters. > > I think that is impossible. It is demonstratably not impossible. Here is a rather useless spec which describes what to do with every possible stream of input characters: "For every input character: do nothing" If you read the WHATWG parsing spec, you will see it is not written as a set of rules for correct content plus some rules on handling errors (the rules on correct content are written elsewhere). Instead it is written as an explicit state machine describing how to deal with each character received by the parser. For example, the text for the data state reads: "Data state Consume the next input character: U+0026 AMPERSAND (&) When the content model flag is set to one of the PCDATA or RCDATA states: switch to the entity data state. Otherwise: treat it as per the "anything else" entry below. U+003C LESS-THAN SIGN (<) When the content model flag is set to a state other than the PLAINTEXT state: switch to the tag open state. Otherwise: treat it as per the "anything else" entry below. EOF Emit an end-of-file token. Anything else Emit the input character as a character token. Stay in the data state." Clearly this handles all possible input to that state (there are further rules written in a similar style for how to handle the tokens emitted by the tokenizer to construct a DOM tree). Handling all possible input is not the hard problem. The hard problems are doing the optimal thing with each possible input (that is, making the 'best' possible DOM tree where 'best' is something that needs to be determined based on constraints like compatibility) and ensuring implementations agree. The first can be improved by testing the HTML5 parser against real-world content, and the second can be done by ensuring we have good test coverage of the parser spec and testing implementations against each other in some automated way (e.g. using a HTML fuzzer to feed different implementations the same malformed pages and checking the resulting DOM trees against each other). I have started working a little on how html5lib[1] (a python implementation of the HTML5 parser algorithm) works on real-world content - the code for what I have so far is available at [2] (patches welcome!) and is available online at [3] although I should stress that there are still significant known issues (including very poor behavior in cases where e.g. a URI cannot be accessed). However, what's really needed is an implementation in a javascript-supporting environment so that complexities such as stream injection through document.write can be put to the test. [1] http://code.google.com/p/html5lib/ [2] http://html5.googlecode.com/svn/trunk/parsetree-viewer/ [3] http://hasather.net/html5/parsetree/ -- "Instructions to follow very carefully. Go to Tesco's. Go to the coffee aisle. Look at the instant coffee. Notice that Kenco now comes in refil packs. Admire the tray on the shelf. It's exquiste corrugated boxiness. The way how it didn't get crushed on its long journey from the factory. Now pick up a refil bag. Admire the antioxidant claim. Gaze in awe at the environmental claims written on the back of the refil bag. Start stroking it gently, its my packaging precious, all mine.... Be thankful that Amy has only given you the highlights of the reasons why that bag is so brilliant." -- ajs
Received on Saturday, 5 May 2007 11:47:18 UTC