[whatwg] html5 parsing/tokenizing from Benjamin West on 2007-06-19 (public-whatwg-archive@w3.org from June 2007)

From: Benjamin West <bewest@gmail.com>
Date: Tue, 19 Jun 2007 16:20:11 -0700
Message-ID: <8ad71be30706191620t43a6ab88v51037e1bf6c49f6@mail.gmail.com>

I have a friend who has implemented a fast tokenizer in C.  I asked
him to send me any feedback he might have, and so what follows are his
words.  This is from about a month ago, so I apologize if any of this
is old ground.

-Ben

-------------
When the tokenization state machine is defined, every state first
"consumes" and then potentially "emits". Some of the states transfer to
another state with an order to "re-consume the character in the next
state". This means that what you do in the new state is dependant on
what you did in the last state and that the "comsume" is necessarily an
inconsistent operation. A much better wording would be "look at the next
character" and on state transition "consume and emit" or just "emit
without consumption" making it clear when the input cursor moves.



It would be nice if all <!...> tags (except comments) were considered
"declarations" instead of bogus comments. Then DOCTYPE wouldn't need
special handling by the tokenizer, just special handling by the parser.
(Too much of the parser seems to have gotten into the tokenizer; with
CDATA and RCDATA, this is a necessary evil. With <!DOCTYPE ...> it
isn't.)



Other than that, the definition is pretty solid and I've come to terms
with the xml-interoperability issues I formerly expressed. I've added a
switch to my parser that tells it whether or not to honor RCDATA
sections and I've purposed never to feed it CDATA. (I know it's not
supposed to be an xml parser.) ~D

Received on Tuesday, 19 June 2007 16:20:11 UTC