- From: Tab Atkins Jr. <jackalmage@gmail.com>
- Date: Mon, 28 May 2012 08:08:47 -0700
- To: Simon Sapin <simon.sapin@kozea.fr>
- Cc: www-style@w3.org
On Sun, May 27, 2012 at 11:21 AM, Simon Sapin <simon.sapin@kozea.fr> wrote: > Le 26/05/2012 00:23, Tab Atkins Jr. a écrit : >> So the question is simply, as someone implementing or maintaining a >> parser, which style is more useful to read? > > As an implementer for WeasyPrint and tinycss, I prefer very much to cleanly > separate the various steps. Dividing a complex problem into smaller problems > makes it easier to think about. > > This means that the tokenizer and parser only communicate through a > well-defined API, and that API is as small as possible. In this case, the > tokenizer turns a flat sequence of Unicode codepoints into a flat sequence > of tokens. The parser turns these tokens into some higher-level data > structure. The tokenizer does not know anything about the parser. (Turning > bytes into codepoints is yet another step, that I separate from the > tokenizer.) > > This does *not* mean that the tokenizer has to be finished and all the > tokens in memory before the parser can start. There can be some kind of > iterator where tokens are generated on demand. But this is only an > implementation detail. Okay. > This leaves the problem of :nth-*(). I can’t find the reference, but I > remember reading a suggestion on this list: the tokens between '(' and ')' > could be serialized back to an Unicode string, and tokenized again by a > different tokenizer. Yes, that's one way to do it. You can reconstruct the text of the an+b from the tokens adequately enough to do this. The only details you'll lose is comments and exactly what sort of whitespace is used. ~TJ
Received on Monday, 28 May 2012 15:09:39 UTC