- From: Thomas Broyer <t.broyer@gmail.com>
- Date: Wed, 20 Jun 2007 09:38:31 +0200
> When the tokenization state machine is defined, every state first > "consumes" and then potentially "emits". Some of the states transfer to > another state with an order to "re-consume the character in the next > state". This means that what you do in the new state is dependant on > what you did in the last state and that the "comsume" is necessarily an > inconsistent operation. A much better wording would be "look at the next > character" and on state transition "consume and emit" or just "emit > without consumption" making it clear when the input cursor moves. I did the same in Twintsam with PeekChar/PeekChars and EatChar/EatChars methods. http://twintsam.googlecode.com/svn/trunk/Twintsam/Html/HtmlReader.StreamHandling.cs (beware, Twintsam hasn't been updated since January so it's not in sync with the spec as it is now) though actually you could just use a character queue into which you push back characters that needs to be "re-consumed" (i.e. you "un-read" the character and then you switch to the other state). This is what html5lib does: http://html5lib.googlecode.com/svn/trunk/python/src/tokenizer.py (search for self.stream.queue; this needs to be refactored with an unread() method on the HTMLInputStream) That is to say, I don't think the spec should be changed at all. It's just a matter of how you implement it. You just have to know that the "queue" won't ever be larger than 9 characters as there are tweaks for 0-prefixed numeric entities and/or numeric entities greater 1114111. > It would be nice if all <!...> tags (except comments) were considered > "declarations" instead of bogus comments. Then DOCTYPE wouldn't need > special handling by the tokenizer, just special handling by the parser. > (Too much of the parser seems to have gotten into the tokenizer; with > CDATA and RCDATA, this is a necessary evil. With <!DOCTYPE ...> it > isn't.) I can't see the problem here; plus DOCTYPE parsing is special because we need the DOCTYPE name. Moreover, the spec has changed recently so that DOCTYPE parsing takes care of PUBLIC and SYSTEM identifiers. -- Thomas Broyer
Received on Wednesday, 20 June 2007 00:38:31 UTC