[css3-syntax] Tokenizing expressions. {Was: Re: [css3-values] calc(), whitespace, and DIMENSION tokens}

Bert Bos wrote:
> On Wednesday 12 March 2008 13:41, Andrei Polushin wrote:
>> 2008/3/12, fantasai wrote:
>>>
>>>  I'll note that
>>>    3cm-2cm
>>>  will be parsed as a single DIMENSION token
>>>
>> I propose changing the grammar around the {ident} as follows:
>>   [...]
>> That is, the unit name cannot contain '-', unless that unit name
>>
>> starts with either '-' or '_', as described by [1]
>
> That will indeed make "3n-1" parse as DIMEN(3n) + DELIM(-) + NUMBER(1), 
> but parsers still have to check that the DIMEN is a whole number with 
> an "n" or "N"; and "n-1" remains an IDENT. (Of course, "n-1" won't 
> occur very often in practice. :-) )

Yes, I did also realize it while reading implementation notes at
https://bugzilla.mozilla.org/show_bug.cgi?id=75375#c35


> So I think there is no benefit for nth-child.

Yes and no. There is no direct benefit, but CSS expression syntax
may still evolve that way.

Looks like WebKit goes further, see [the excerpt of its grammar][2]:

     nth             (-?[0-9]*n[\+-][0-9]+)|(-?[0-9]*n)

That means that IDENT in the form (n|n-1|-n|-n-1) is not an IDENT,
and DIMENSION in the form  (3n|3n-1|-3n|-3n-1) is not a DIMENSION.
They are NTH token there. It looks like a hack, it may require
attention in other parts of grammar (esp. class selectors like .-n),
but it works.


> It also doesn't remove the need for a space after 'mod' in 'calc(10%
> mod-2em)', although it avoids many other possible user errors:
> calc(10em-2px).
>
> But it's a change to the core grammar, a very dangerous thing to do:

That's right. I agree, and now suggest making those changes *locally*,
to the expression grammar only.

The complete proposal is as follows:

  1. The BASIC TOKENIZER remains the same, as defined by CSS21.

  2. In CSS3, certain parts of grammar may be locally parsed as
     expressions. Those parts must continue to be parseable by a
     CSS21-conformant parser. CSS21 parser should be able to skip
     over expressions, treating them as "any" production per CSS
     core syntax.

  3. To parse expressions, CSS3 expression-aware parser should behave as
     if it creates an EXPR-TOKENIZER on the top of its BASIC TOKENIZER.
     CSS3 parser uses that EXPR-TOKENIZER to pull expression tokens.

  4. The EXPR-TOKENIZER consumes tokens provided by the BASIC TOKENIZER,
     splits them according to certain rules, and yields them to a CSS3
     expression-aware parser.

     The splitting rules are:

     4.1. {ident} may not contain substring where MINUS SIGN is
          followed by DECIMAL DIGIT. Such {ident} is split around
          that MINUS SIGN symbol.

          Example:
             IDENT(abc-1em)  => IDENT(abc) '-' DIMENSION(1em)

     4.2. {ident} may not contain trailing MINUS SIGN. Such {ident}
          is split just before that trailing MINUS SIGN.

          Example:
             IDENT(abc-)     => IDENT(abc) '-'

     4.3. {ident} may not start with MINUS SIGN, unless that {ident}
          contains a non-trailing MINUS SIGN, as described by [1].
          Such {ident} is split just after that starting MINUS SIGN.

          Example:
             IDENT(-abc)     =>  '-' IDENT(abc)
             IDENT(-abc-)    =>  '-' IDENT(abc) '-'
             IDENT(-abc-def) =>  IDENT(-abc-def)

     4.4. The splitting rules above apply equally to any {ident} that
          is part of either IDENT or DIMENSION or FUNCTION tokens.

     4.5. In addition, the {ident} that designates the measurement unit
          of the DIMENSION token, may not contain a MINUS SIGN, unless
          that {ident} itself starts with a MINUS SIGN, as described
          by [1]. Such {ident} is split around that MINUS SIGN symbol.

          Example:
             DIMENSION(3cm-2cm) => DIMENSION(3cm) '-' DIMENSION(2cm)
             DIMENSION(3-x-parsec) => DIMENSION(3-x-parsec)


The formal grammar of the EXPR-TOKENIZER is as follows (it cannot
be used directly, though):

   %

   alpha     [_a-z]|{nonascii}|{escape}
   alnum     [_a-z0-9]|{nonascii}|{escape}

   word      {alpha}{alnum}*
   phrase    {word}([-]{word})*

   prefixed  [-]{word}[-]{phrase}
   ident     {phrase}|{prefixed}
   unit      {word}|{prefixed}

   %

   {num}{ident}  {return DIMENSION;}
   {ident}       {return IDENT;}
   {ident}"("    {return FUNCTION;}

   %

Now I expect it to cover both the calc() and nth-child() syntax issues, 
while being fully backward compatible with the CSS21 core syntax.

[1]: http://www.w3.org/TR/2007/CR-CSS21-20070719/syndata.html#vendor-keywords
[2]: 
http://trac.webkit.org/projects/webkit/browser/trunk/WebCore/css/tokenizer.flex?rev=30069#L26

-- 
Andrei Polushin

Received on Tuesday, 18 March 2008 03:01:04 UTC