[css3-syntax] universal version of the CSS2.1 error handling grammar (aka. core grammar)

While I was trying to understand if using <value> for CSS Varaibles
makes sense or not, I realized that the CSS2.1 core grammar can be
extended to an universal one, where universality of a grammar[1] means
that the grammar is capable of generating all possible sequences of tokens.


Here is what I've got so far:

(1) stylesheet   : S* [ ignored S* | statement ]* selector? EOF;
(2) ignored      : [ CDO | CDC ];
(3) statement    : ruleset | at-rule;

/* The following three structures are ignored as a whole if
   unrecognized */
(4) at-rule      : ATKEYWORD S* [ any | ']' S* | ')' S* | '}' S*
                                | ATKEYWORD S* | CDO S* | CDC S* ]*
                   [ block | ';' S* | EOF ];
(5) ruleset      : selector? '{' S* declaration?
                   [ ';' S* declaration? ]* [ '}' S* | EOF ];
(6) declaration  : [ no-close | ']' S* | ')' S* ]+;


(7) selector     : [ any | ';' S* | ']' S* | ')' S* | '}' S* ]
                   [ any | ';' S* | ']' S* | ')' S* | '}' S*
                   | ATKEYWORD S* | CDO S* | CDC S* ]*
(8) any          : [ IDENT | NUMBER | PERCENTAGE | DIMENSION | STRING
                   | DELIM | URI | HASH | UNICODE-RANGE | INCLUDES
                   | DASHMATCH | ':' | BAD_STRING S |
                   | [ FUNCTION | '(' | BAD_URI ]
                         S* [ no-close | ']' S* | ';' S* | '}' S* ]* ')'
                   | '[' S* [ no-close | ')' S* | ';' S* | '}' S* ]* ']'
                   ] S*
                   | [ FUNCTION | '(' | BAD_URI ]
                         S* [ no-close | ']' S* | ';' S* | '}' S* ]* EOF
                   | '[' S* [ no-close | ')' S* | ';' S* | '}' S* ]* EOF
                   | [ BAD_STRING | BAD_COMMENT ] EOF ;
(9)  block       :   '{' S* [ no-close | ')' S* | ';' S* | ']' S* ]*
                 | [ '}' S* | EOF ] ;
(10) no-close    : any | block | ATKEYWORD S* | CDO S* | CDC S*;


What this grammar encodes is:

  1. The error handling rule for malformed structure. This grammar
     segments a stylesheet into a list of statements. It also splits
     each ruleset into a list of declarations. After this is done, a UA
     that is aware of the semantics of a stylesheet can process those
     it recognizes and ignore those that it doesn't recognize.
  2. The end-of-file handling rule.
  3. The requirement that CDO CDC is only ignored at the top level, not
     inside a selector or the thing after the ATKEYWORD.

Some notes here and there:

Note (for rule (1)) that the position of the trailing selector in this
production has normative consequences: a selector followed by nothing
(e.g. 'a EOF') can't be considered a ruleset, even if 'a' can be thought
as the opener of the ruleset. (the difference is measurable only in
CSSOM[1]). If we want 'a EOF' to make a ruleset, the grammar should be

  (1) stylesheet   : S* [ ignored S* | statement ]* EOF;
  (5) ruleset      : selector? '{' S* declaration? [ ';' S*
                     declaration? ]* [ '}' S* | EOF ]
                   | selector EOF;

instead.

Note (for rule(6)): The grmmar of a declaration is dropped. Error
handling has nothing to do with the structure of a declaration. Either
the UA recognizes it as a whole or it doesn't.

Note (for rule(7)): Or in other words, every thing that doesn't start
with a ATKEYWORD.

Note (for rule(8)): This means that the EOF rule applies to parsing of
CSS fragments whenever it has "any" in it (probably all things we care).

Note (for rule(8)): The token sequence "BAD_URI ')'" is not possible
(when we ignore COMMENT), but we include. (Similaraly BAD_STRING X when
X is not S (new line) is not posiible. Also, BAD_COMMENT is always the
last token.)


Given that this grammar is a lot harder to read than the formal grammar
for the tokenization step, I would't think this is too useful. Yet the
grammar is notoriously confusing. In particular, the prose in CSS 2.1

  # Parts of style sheets that can be parsed according to this grammar
  # but not according to the grammar in Appendix G are among the parts
  # that will be ignored according to the rules for handling parsing
  # errors.

with

  # The meaning of input that cannot be tokenized or parsed is
  # undefined in CSS 2.1.

means that its example (heck, that's all the examples under "Malformed
statements") is undefined. Moreover, a third of the test cases in this
chapter are marked invalid. All these make me think I should share this
grammar with the list.


Replying some historical[2] feedback to this chapter (for those who are
interested, this is CSS2.1 ISSUE 140 and ISSUE 252):

(09/10/22 2:46), L. David Baron wrote:
> On Wednesday 2009-10-21 11:46 -0500, Dan Connolly wrote:
>> This leaves me wondering what is the role of the core syntax,
>> especially this statement:
>>
>>   "The meaning of input that cannot be tokenized or parsed is
>>    undefined in CSS 2.1."
>>
>> Is the definition of block in 4.1.6 supposed to be a re-statement
>> of the grammar or an independent definition?
>
> I think this is a bug in the core grammar.
>
> Since the core grammar serves two purposes:
>
>  (1) defining behavior of not-yet-valid CSS so that CSS can be
>  extended in the future

I think this purpose turns out to be not useful given that css-hierarchy
wants to change this. This also makes me wonder if we ever want to use
the never-used <value> production for css-variable. Note that browsers
implement (rule (6)) "[ no-close | ']' S* | ')' S* ]" instead of <value>
and there are some differences.

>  (2) defining behavior of invalid CSS so that implementations
>  interoperate on "garbage" input

I'd hoped that this grammar never existed. It's confusing. It doesn't
match the examples in the same spec. etc. etc.


== Appendix - Bison ==

I've played with the grammar in Bison (code attached, adapted from
Bert's[3]), the program shows no shift/reduce conflicts. I am not clear
what that means but I suppose this implies that every input would be
unambiguously parsed into a parse tree.

The code can be run with (on my Mac OS X)

  flex scan3.l
  bison -v -d css.y
  clang css.tab.c lex.yy.c
  cat EXAMPLE.CSS | ./a.out

For example, matching-brackets-001[4] generates the following result

[[
declaration:
color :  red

declaration:
background :  red

ruleset:
p  {
  color :  red;
  background :  red;
  }

declaration:
background :  transparent

ruleset:
#semicolon  {
  background :  transparent;
  }

at_rule:
@foo  ] } ) test-token  \
 ~  `  !  @  #  $  %  ^  &  *  -  _  +  =  |  :  >  <  ?  /  ,  .  [
\]\5D  ']'  "]" ; #  {
 background :  red ;}
]  ( \)\29  ')'  ")" ; #semicolon  {
 background :  red ;}
 } } })  '; #semicolon { background: red; } } } }' ,  "; #semicolon {
background: red; }' } } }" ;

declaration:
color :  green

ruleset:
#semicolon  {
  color :  green;
  }

declaration:
background :  transparent

ruleset:
#block  {
  background :  transparent;
  }

at_rule:
@foo  ] } ) test-token  \
 ~  `  !  @  #  $  %  ^  &  *  -  _  +  =  |  :  >  <  ?  /  ,  .  [
\]\5D  ']'  "]" ; #block  {
 background :  red ;}
]  ( \)\29  ')'  ")" ; #block  {
 background :  red ;}
)  '\'; #block { background: red; }' ,  "\"; #block { background: red; }'" {
 \}\79  '}'  "}" ; #block  {
 background :  red ;}
 #block  {
 background :  red ;}
}

declaration:
color :  green

ruleset:
#block  {
  color :  green;
  }
]]

while Bert's just fails at the very beginning.


(I spent more time in getting Bison setup than actually writing down the
grammar, so if you ever get stuck with building the code, just send me
an email.)


[1] http://en.wikipedia.org/wiki/Context-free_grammar#Universality
[2] http://lists.w3.org/Archives/Public/www-style/2009Oct/0262
[3] http://lists.w3.org/Archives/Public/www-style/2010Aug/0368
[4]
http://test.csswg.org/suites/css2.1/nightly-unstable/xhtml1/matching-brackets-001.xht


For those that were once deeply confused by these grammar rules (yeah,
that includes me),
Kenny

Received on Thursday, 31 May 2012 01:04:21 UTC