Re: [css3-syntax] Preference for parser speccing? from Tab Atkins Jr. on 2012-05-28 (www-style@w3.org from May 2012)

From: Tab Atkins Jr. <jackalmage@gmail.com>
Date: Mon, 28 May 2012 08:08:47 -0700
To: Simon Sapin <simon.sapin@kozea.fr>
Cc: www-style@w3.org
Message-ID: <CAAWBYDCRYbZH1i71aeLQAkVS+F9e7NzcOBNSP8QYBm=f49wReg@mail.gmail.com>

On Sun, May 27, 2012 at 11:21 AM, Simon Sapin <simon.sapin@kozea.fr> wrote:
> Le 26/05/2012 00:23, Tab Atkins Jr. a écrit :
>> So the question is simply, as someone implementing or maintaining a
>> parser, which style is more useful to read?
>
> As an implementer for WeasyPrint and tinycss, I prefer very much to cleanly
> separate the various steps. Dividing a complex problem into smaller problems
> makes it easier to think about.
>
> This means that the tokenizer and parser only communicate through a
> well-defined API, and that API is as small as possible. In this case, the
> tokenizer turns a flat sequence of Unicode codepoints into a flat sequence
> of tokens. The parser turns these tokens into some higher-level data
> structure. The tokenizer does not know anything about the parser. (Turning
> bytes into codepoints is yet another step, that I separate from the
> tokenizer.)
>
> This does *not* mean that the tokenizer has to be finished and all the
> tokens in memory before the parser can start. There can be some kind of
> iterator where tokens are generated on demand. But this is only an
> implementation detail.

Okay.

> This leaves the problem of :nth-*(). I can’t find the reference, but I
> remember reading a suggestion on this list: the tokens between '(' and ')'
> could be serialized back to an Unicode string, and tokenized again by a
> different tokenizer.

Yes, that's one way to do it.  You can reconstruct the text of the
an+b from the tokens adequately enough to do this.  The only details
you'll lose is comments and exactly what sort of whitespace is used.

~TJ

Received on Monday, 28 May 2012 15:09:39 UTC