- From: Simon Sapin <simon.sapin@kozea.fr>
- Date: Sun, 27 May 2012 20:21:11 +0200
- To: www-style@w3.org
Le 26/05/2012 00:23, Tab Atkins Jr. a écrit :
> So the question is simply, as someone implementing or maintaining a
> parser, which style is more useful to read?
As an implementer for WeasyPrint and tinycss, I prefer very much to
cleanly separate the various steps. Dividing a complex problem into
smaller problems makes it easier to think about.
This means that the tokenizer and parser only communicate through a
well-defined API, and that API is as small as possible. In this case,
the tokenizer turns a flat sequence of Unicode codepoints into a flat
sequence of tokens. The parser turns these tokens into some higher-level
data structure. The tokenizer does not know anything about the parser.
(Turning bytes into codepoints is yet another step, that I separate from
the tokenizer.)
This does *not* mean that the tokenizer has to be finished and all the
tokens in memory before the parser can start. There can be some kind of
iterator where tokens are generated on demand. But this is only an
implementation detail.
This leaves the problem of :nth-*(). I can’t find the reference, but I
remember reading a suggestion on this list: the tokens between '(' and
')' could be serialized back to an Unicode string, and tokenized again
by a different tokenizer.
--
Simon Sapin
Received on Sunday, 27 May 2012 18:21:40 UTC