- From: Kang-Hao (Kenny) Lu <kennyluck@csail.mit.edu>
- Date: Thu, 31 May 2012 09:03:50 +0800
- To: WWW Style <www-style@w3.org>
- Message-ID: <4FC6C376.5080308@csail.mit.edu>
While I was trying to understand if using <value> for CSS Varaibles makes sense or not, I realized that the CSS2.1 core grammar can be extended to an universal one, where universality of a grammar[1] means that the grammar is capable of generating all possible sequences of tokens. Here is what I've got so far: (1) stylesheet : S* [ ignored S* | statement ]* selector? EOF; (2) ignored : [ CDO | CDC ]; (3) statement : ruleset | at-rule; /* The following three structures are ignored as a whole if unrecognized */ (4) at-rule : ATKEYWORD S* [ any | ']' S* | ')' S* | '}' S* | ATKEYWORD S* | CDO S* | CDC S* ]* [ block | ';' S* | EOF ]; (5) ruleset : selector? '{' S* declaration? [ ';' S* declaration? ]* [ '}' S* | EOF ]; (6) declaration : [ no-close | ']' S* | ')' S* ]+; (7) selector : [ any | ';' S* | ']' S* | ')' S* | '}' S* ] [ any | ';' S* | ']' S* | ')' S* | '}' S* | ATKEYWORD S* | CDO S* | CDC S* ]* (8) any : [ IDENT | NUMBER | PERCENTAGE | DIMENSION | STRING | DELIM | URI | HASH | UNICODE-RANGE | INCLUDES | DASHMATCH | ':' | BAD_STRING S | | [ FUNCTION | '(' | BAD_URI ] S* [ no-close | ']' S* | ';' S* | '}' S* ]* ')' | '[' S* [ no-close | ')' S* | ';' S* | '}' S* ]* ']' ] S* | [ FUNCTION | '(' | BAD_URI ] S* [ no-close | ']' S* | ';' S* | '}' S* ]* EOF | '[' S* [ no-close | ')' S* | ';' S* | '}' S* ]* EOF | [ BAD_STRING | BAD_COMMENT ] EOF ; (9) block : '{' S* [ no-close | ')' S* | ';' S* | ']' S* ]* | [ '}' S* | EOF ] ; (10) no-close : any | block | ATKEYWORD S* | CDO S* | CDC S*; What this grammar encodes is: 1. The error handling rule for malformed structure. This grammar segments a stylesheet into a list of statements. It also splits each ruleset into a list of declarations. After this is done, a UA that is aware of the semantics of a stylesheet can process those it recognizes and ignore those that it doesn't recognize. 2. The end-of-file handling rule. 3. The requirement that CDO CDC is only ignored at the top level, not inside a selector or the thing after the ATKEYWORD. Some notes here and there: Note (for rule (1)) that the position of the trailing selector in this production has normative consequences: a selector followed by nothing (e.g. 'a EOF') can't be considered a ruleset, even if 'a' can be thought as the opener of the ruleset. (the difference is measurable only in CSSOM[1]). If we want 'a EOF' to make a ruleset, the grammar should be (1) stylesheet : S* [ ignored S* | statement ]* EOF; (5) ruleset : selector? '{' S* declaration? [ ';' S* declaration? ]* [ '}' S* | EOF ] | selector EOF; instead. Note (for rule(6)): The grmmar of a declaration is dropped. Error handling has nothing to do with the structure of a declaration. Either the UA recognizes it as a whole or it doesn't. Note (for rule(7)): Or in other words, every thing that doesn't start with a ATKEYWORD. Note (for rule(8)): This means that the EOF rule applies to parsing of CSS fragments whenever it has "any" in it (probably all things we care). Note (for rule(8)): The token sequence "BAD_URI ')'" is not possible (when we ignore COMMENT), but we include. (Similaraly BAD_STRING X when X is not S (new line) is not posiible. Also, BAD_COMMENT is always the last token.) Given that this grammar is a lot harder to read than the formal grammar for the tokenization step, I would't think this is too useful. Yet the grammar is notoriously confusing. In particular, the prose in CSS 2.1 # Parts of style sheets that can be parsed according to this grammar # but not according to the grammar in Appendix G are among the parts # that will be ignored according to the rules for handling parsing # errors. with # The meaning of input that cannot be tokenized or parsed is # undefined in CSS 2.1. means that its example (heck, that's all the examples under "Malformed statements") is undefined. Moreover, a third of the test cases in this chapter are marked invalid. All these make me think I should share this grammar with the list. Replying some historical[2] feedback to this chapter (for those who are interested, this is CSS2.1 ISSUE 140 and ISSUE 252): (09/10/22 2:46), L. David Baron wrote: > On Wednesday 2009-10-21 11:46 -0500, Dan Connolly wrote: >> This leaves me wondering what is the role of the core syntax, >> especially this statement: >> >> "The meaning of input that cannot be tokenized or parsed is >> undefined in CSS 2.1." >> >> Is the definition of block in 4.1.6 supposed to be a re-statement >> of the grammar or an independent definition? > > I think this is a bug in the core grammar. > > Since the core grammar serves two purposes: > > (1) defining behavior of not-yet-valid CSS so that CSS can be > extended in the future I think this purpose turns out to be not useful given that css-hierarchy wants to change this. This also makes me wonder if we ever want to use the never-used <value> production for css-variable. Note that browsers implement (rule (6)) "[ no-close | ']' S* | ')' S* ]" instead of <value> and there are some differences. > (2) defining behavior of invalid CSS so that implementations > interoperate on "garbage" input I'd hoped that this grammar never existed. It's confusing. It doesn't match the examples in the same spec. etc. etc. == Appendix - Bison == I've played with the grammar in Bison (code attached, adapted from Bert's[3]), the program shows no shift/reduce conflicts. I am not clear what that means but I suppose this implies that every input would be unambiguously parsed into a parse tree. The code can be run with (on my Mac OS X) flex scan3.l bison -v -d css.y clang css.tab.c lex.yy.c cat EXAMPLE.CSS | ./a.out For example, matching-brackets-001[4] generates the following result [[ declaration: color : red declaration: background : red ruleset: p { color : red; background : red; } declaration: background : transparent ruleset: #semicolon { background : transparent; } at_rule: @foo ] } ) test-token \ ~ ` ! @ # $ % ^ & * - _ + = | : > < ? / , . [ \]\5D ']' "]" ; # { background : red ;} ] ( \)\29 ')' ")" ; #semicolon { background : red ;} } } }) '; #semicolon { background: red; } } } }' , "; #semicolon { background: red; }' } } }" ; declaration: color : green ruleset: #semicolon { color : green; } declaration: background : transparent ruleset: #block { background : transparent; } at_rule: @foo ] } ) test-token \ ~ ` ! @ # $ % ^ & * - _ + = | : > < ? / , . [ \]\5D ']' "]" ; #block { background : red ;} ] ( \)\29 ')' ")" ; #block { background : red ;} ) '\'; #block { background: red; }' , "\"; #block { background: red; }'" { \}\79 '}' "}" ; #block { background : red ;} #block { background : red ;} } declaration: color : green ruleset: #block { color : green; } ]] while Bert's just fails at the very beginning. (I spent more time in getting Bison setup than actually writing down the grammar, so if you ever get stuck with building the code, just send me an email.) [1] http://en.wikipedia.org/wiki/Context-free_grammar#Universality [2] http://lists.w3.org/Archives/Public/www-style/2009Oct/0262 [3] http://lists.w3.org/Archives/Public/www-style/2010Aug/0368 [4] http://test.csswg.org/suites/css2.1/nightly-unstable/xhtml1/matching-brackets-001.xht For those that were once deeply confused by these grammar rules (yeah, that includes me), Kenny
Attachments
Received on Thursday, 31 May 2012 01:04:21 UTC