- From: Bert Bos <bert@w3.org>
- Date: Wed, 18 Aug 2010 22:29:25 +0200
- To: W3C style mailing list <www-style@w3.org>
- Message-Id: <201008182229.25960.bert@w3.org>
I had an action to write up the text for issue 129[1], which implements Zack Weinberg's modifications to the CSS tokenizer. Background: The goal of those modifications was to avoid that a lexical scanner had to "back-up." E.g., the 2-character input "@-" could be the start of an at-keyword, but if it isn't followed by a letter, the tokenizer has to go back and treat it as two separate DELIM tokens instead. Avoiding back-up gives a (tiny) bit of gain in speed. In most cases, such as "@-", looking ahead at most two characters solves the issue. I.e., it's an implementation issue, not an issue with CSS. But one case, the url(...) token, is different. Something like "url((a,b))" is *not* a URI token, because of the illegal second "(". But there can be arbitrarily many characters before that parenthesis, and thus looking a given number of characters ahead doesn't work. And so an unfortunate side-effect of the modifications is that something that starts with "url(" must now either be a URI token or an error, while previously the tokenizer would back up and re-parse it as a FUNCTION. No part of CSS relied on "url()" sometimes being a FUNCTION, of course, but in theory some private extension could have used this hack. Proposed text: I tested the modifications with flex and they appear to work. Valid style sheets are still valid and invalid ones are still invalid (with the exception of the "url(" issue above). I propose the following changes to chapter 4[2] and appendix G[3]: * In 4.1.1, in the table of tokens, change INVALID {invalid} to BAD_STRING {badstring} BAD_URI {baduri} BAD_COMMENT {badcomment} * In the table with macros, change invalid {invalid1}|{invalid2} invalid1 \"([^\n\r\f\\"]|\\{nl}|{escape})* invalid2 \'([^\n\r\f\\']|\\{nl}|{escape})* to badstring {badstring1}|{badstring2} badstring1 \"([^\n\r\f\\"]|\\{nl}|{escape})*\\? badstring2 \'([^\n\r\f\\']|\\{nl}|{escape})*\\? * Add the following macros to that same table: badcomment {badcomment1}|{badcomment2} badcomment1 \/\*[^*]*\*+([^/*][^*]*\*+)* badcomment2 \/\*[^*]*(\*+[^/*][^*]*)* baduri {baduri1}|{baduri2}|{baduri3} baduri1 url\({w}([!#$%&*-~]|{nonascii}|{escape})*{w} baduri2 url\({w}{string}{w} baduri3 url\({w}{badstring} * In section G.2, change invalid1 \"([^\n\r\f\\"]|\\{nl}|{escape})* invalid2 \'([^\n\r\f\\']|\\{nl}|{escape})* to badstring1 \"([^\n\r\f\\"]|\\{nl}|{escape})*\\? badstring2 \'([^\n\r\f\\']|\\{nl}|{escape})*\\? badcomment1 \/\*[^*]*\*+([^/*][^*]*\*+)* badcomment2 \/\*[^*]*(\*+[^/*][^*]*)* baduri1 url\({w}([!#$%&*-~]|{nonascii}|{escape})*{w} baduri2 url\({w}{string}{w} baduri3 url\({w}{badstring} * A few lines down, change invalid {invalid1}|{invalid2} to badstring {badstring1}|{badstring2} badcomment {badcomment1}|{badcomment2} baduri {baduri1}|{baduri2}|{baduri3} * Further down again, change {invalid} {return INVALID; /* unclosed string */} to {badstring} {return BAD_STRING);} * After \/\*[^*]*\*+([^/*][^*]*\*+)*\/ /* ignore comments */ add {badcomment} /* unclosed comment at EOF */ * After {U}{R}{L}"("{w}{string}{w}")" {return URI;} {U}{R}{L}"("{w}{url}{w}")" {return URI;} add {baduri} {return BAD_URI);} * Insert this new section: G.4 Implementation note This section is non-normative. The lexical scanner for the CSS core syntax in section 4.1.1 can be implemented as a scanner without back-up. In Lex notation, that requires the addition of the following patterns (which do not change the returned tokens, only the efficiency of the scanner): {ident}/\\ return IDENT; #{name}/\\ return HASH; @{ident}/\\ return ATKEYWORD; #/\\ return DELIM; @/\\ return DELIM; @/- return DELIM; @/-\\ return DELIM; -/\\ return DELIM; -/- return DELIM; \</! return DELIM; \</!- return DELIM; {num}{ident}/\\ return DIMENSION; {num}/\\ return NUMBER; {num}/- return NUMBER; {num}/-\\ return NUMBER; [0-9]+/\. return NUMBER; u/\+ return IDENT; u\+[0-9a-f?]{1,6}/- return UNICODE_RANGE; [1] http://wiki.csswg.org/spec/css2.1#issue-129 [2] http://www.w3.org/TR/2009/CR-CSS2-20090908/syndata.html [3] http://www.w3.org/TR/2009/CR-CSS2-20090908/grammar.html Bert PS. For people who want to experiment, I attached a flex scanner with the core tokenizer (section 4.1.1) and the added patterns (from the new section G.4 above). To test it with, I also attached a bison grammar (that also implements some of the error recovery rules). Compile with bison -d css.y mv css.tab.h css.h flex scan3.l cc lex.yy.c css.tab.c -ly Run it with ./a.out < some-style-sheet.css e.g., ./a.out <<< "p { color: red }" ./a.out <<< "p { content: url((not-valid)) }" The program outputs the style sheet minus any errors (and formatted differently). Try "flex -F -v scan3.l" to confirm that the tokenizer indeed doesn't need backing up (but gets 6 times as big instead :-) ). -- Bert Bos ( W 3 C ) http://www.w3.org/ http://www.w3.org/people/bos W3C/ERCIM bert@w3.org 2004 Rt des Lucioles / BP 93 +33 (0)4 92 38 76 92 06902 Sophia Antipolis Cedex, France
Attachments
Received on Wednesday, 18 August 2010 20:29:58 UTC