- From: Benjamin West <bewest@gmail.com>
- Date: Tue, 19 Jun 2007 16:20:11 -0700
I have a friend who has implemented a fast tokenizer in C. I asked him to send me any feedback he might have, and so what follows are his words. This is from about a month ago, so I apologize if any of this is old ground. -Ben ------------- When the tokenization state machine is defined, every state first "consumes" and then potentially "emits". Some of the states transfer to another state with an order to "re-consume the character in the next state". This means that what you do in the new state is dependant on what you did in the last state and that the "comsume" is necessarily an inconsistent operation. A much better wording would be "look at the next character" and on state transition "consume and emit" or just "emit without consumption" making it clear when the input cursor moves. It would be nice if all <!...> tags (except comments) were considered "declarations" instead of bogus comments. Then DOCTYPE wouldn't need special handling by the tokenizer, just special handling by the parser. (Too much of the parser seems to have gotten into the tokenizer; with CDATA and RCDATA, this is a necessary evil. With <!DOCTYPE ...> it isn't.) Other than that, the definition is pretty solid and I've come to terms with the xml-interoperability issues I formerly expressed. I've added a switch to my parser that tells it whether or not to honor RCDATA sections and I've purposed never to feed it CDATA. (I know it's not supposed to be an xml parser.) ~D
Received on Tuesday, 19 June 2007 16:20:11 UTC