Re: [whatwg] Tokenizor PseudoCode from Ian Hickson on 2013-07-01 (public-whatwg-archive@w3.org from July 2013)

From: Ian Hickson <ian@hixie.ch>
Date: Mon, 1 Jul 2013 23:03:38 +0000 (UTC)
To: "Mohammad Al Houssami (Alumni)" <mha53@mail.aub.edu>, Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: "whatwg@lists.whatwg.org" <whatwg@lists.whatwg.org>
Message-ID: <Pine.LNX.4.64.1307012258140.11139@ps20323.dreamhostps.com>

On Fri, 15 Mar 2013, Mohammad Al Houssami (Alumni) wrote:
> 
> I just want to make sure that in places where no state change is called 
> it means we stay in the same state right? Take the RCDATA state below. 
> In the anything else branch we emit character token and then go consume 
> another character and check all the cases in this state. This is the 
> only thing that makes sense but I just want to make sure :)

On Sat, 16 Mar 2013, Bjoern Hoehrmann wrote:
> 
> You missed "When a token is emitted, it must immediately be handled by 
> the tree construction stage. The tree construction stage can affect the 
> state of the tokenization stage ..." but if that does not result in a 
> change of state either, then yes, as far as I am aware.

On Fri, 15 Mar 2013, Mohammad Al Houssami (Alumni) wrote:
>
> I'm trying to build an HTML5 Parser in Smalltalk and as a first step I'm 
> implementing the tokenizer and everything happens there. I think this is 
> the case only when we have scripts that add characters to the HTML 
> document which is out of the scope of the project I am working on at the 
> moment. Is this true or not ?

On Sat, 16 Mar 2013, Bjoern Hoehrmann wrote:
> 
> No. Grepping for "PLAINTEXT" should make this clear.

There's a number of places in the tree construction stage that change the 
tokenizer state, in particular, the parsing for these elements: title, 
noscript, noframes, style, xmp, iframe, noembed, script, plaintext, 
textarea.

HTH,
-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Monday, 1 July 2013 23:04:05 UTC