Re: [css3-syntax] Preference for parser speccing? from Simon Sapin on 2012-05-27 (www-style@w3.org from May 2012)

From: Simon Sapin <simon.sapin@kozea.fr>
Date: Sun, 27 May 2012 20:21:11 +0200
To: www-style@w3.org
Message-ID: <4FC27097.8060308@kozea.fr>

Le 26/05/2012 00:23, Tab Atkins Jr. a écrit :
> So the question is simply, as someone implementing or maintaining a
> parser, which style is more useful to read?

As an implementer for WeasyPrint and tinycss, I prefer very much to 
cleanly separate the various steps. Dividing a complex problem into 
smaller problems makes it easier to think about.

This means that the tokenizer and parser only communicate through a 
well-defined API, and that API is as small as possible. In this case, 
the tokenizer turns a flat sequence of Unicode codepoints into a flat 
sequence of tokens. The parser turns these tokens into some higher-level 
data structure. The tokenizer does not know anything about the parser. 
(Turning bytes into codepoints is yet another step, that I separate from 
the tokenizer.)

This does *not* mean that the tokenizer has to be finished and all the 
tokens in memory before the parser can start. There can be some kind of 
iterator where tokens are generated on demand. But this is only an 
implementation detail.

This leaves the problem of :nth-*(). I can’t find the reference, but I 
remember reading a suggestion on this list: the tokens between '(' and 
')' could be serialized back to an Unicode string, and tokenized again 
by a different tokenizer.

-- 
Simon Sapin

Received on Sunday, 27 May 2012 18:21:40 UTC