Re: Support Existing Content from James Graham on 2007-05-05 (public-html@w3.org from May 2007)

From: James Graham <jg307@cam.ac.uk>
Date: Sat, 05 May 2007 12:45:24 +0100
To: Gareth Hay <gazhay@gmail.com>
Cc: Maciej Stachowiak <mjs@apple.com>, Chris Wilson <Chris.Wilson@microsoft.com>, Jeff Schiller <codedread@gmail.com>, "matt@builtfromsource.com" <matt@builtfromsource.com>, "public-html@w3.org" <public-html@w3.org>
Message-ID: <463C6E54.1040300@cam.ac.uk>

Gareth Hay wrote:

>> The spec describes what to do with every possible stream of input 
>> characters.
> 
> I think that is impossible.

It is demonstratably not impossible. Here is a rather useless spec which 
describes what to do with every possible stream of input characters:

"For every input character: do nothing"

If you read the WHATWG parsing spec, you will see it is not written as a 
set of rules for correct content plus some rules on handling errors (the 
rules on correct content are written elsewhere). Instead it is written 
as an explicit state machine describing how to deal with each character 
received by the parser. For example, the text for the data state reads:

"Data state

     Consume the next input character:

     U+0026 AMPERSAND (&)
         When the content model flag is set to one of the PCDATA or 
RCDATA states: switch to the entity data state.
         Otherwise: treat it as per the "anything else" entry below.
     U+003C LESS-THAN SIGN (<)
         When the content model flag is set to a state other than the 
PLAINTEXT state: switch to the tag open state.
         Otherwise: treat it as per the "anything else" entry below.
     EOF
         Emit an end-of-file token.
     Anything else
         Emit the input character as a character token. Stay in the data 
state."

Clearly this handles all possible input to that state (there are further 
rules written in a similar style for how to handle the tokens emitted by 
the tokenizer to construct a DOM tree).

Handling all possible input is not the hard problem. The hard problems 
are doing the optimal thing with each possible input (that is, making 
the 'best' possible DOM tree where 'best' is something that needs to be 
determined based on constraints like compatibility) and ensuring 
implementations agree. The first can be improved by testing the HTML5 
parser against real-world content, and the second can be done by 
ensuring we have good test coverage of the parser spec and testing 
implementations against each other in some automated way (e.g. using a 
HTML fuzzer to feed different implementations the same malformed pages 
and checking the resulting DOM trees against each other).

I have started working a little on how html5lib[1] (a python 
implementation of the HTML5 parser algorithm) works on real-world 
content - the code for what I have so far is available at [2] (patches 
welcome!) and is available online at [3] although I should stress that 
there are still significant known issues (including very poor behavior 
in cases where e.g. a URI cannot be accessed). However, what's really 
needed is an implementation in a javascript-supporting environment so 
that complexities such as stream injection through document.write can be 
put to the test.

[1] http://code.google.com/p/html5lib/
[2] http://html5.googlecode.com/svn/trunk/parsetree-viewer/
[3] http://hasather.net/html5/parsetree/
-- 
"Instructions to follow very carefully.
Go to Tesco's.  Go to the coffee aisle.  Look at the instant coffee. 
Notice that Kenco now comes in refil packs.  Admire the tray on the 
shelf.  It's exquiste corrugated boxiness. The way how it didn't get 
crushed on its long journey from the factory. Now pick up a refil bag. 
Admire the antioxidant claim.  Gaze in awe at the environmental claims 
written on the back of the refil bag.  Start stroking it gently, its my 
packaging precious, all mine....  Be thankful that Amy has only given 
you the highlights of the reasons why that bag is so brilliant."
-- ajs

Received on Saturday, 5 May 2007 11:47:18 UTC