- From: Bert Bos <bert@w3.org>
- Date: Wed, 18 Aug 2010 22:29:25 +0200
- To: W3C style mailing list <www-style@w3.org>
- Message-Id: <201008182229.25960.bert@w3.org>
I had an action to write up the text for issue 129[1], which implements
Zack Weinberg's modifications to the CSS tokenizer.
Background:
The goal of those modifications was to avoid that a lexical scanner had
to "back-up." E.g., the 2-character input "@-" could be the start of an
at-keyword, but if it isn't followed by a letter, the tokenizer has to
go back and treat it as two separate DELIM tokens instead. Avoiding
back-up gives a (tiny) bit of gain in speed.
In most cases, such as "@-", looking ahead at most two characters solves
the issue. I.e., it's an implementation issue, not an issue with CSS.
But one case, the url(...) token, is different. Something
like "url((a,b))" is *not* a URI token, because of the illegal
second "(". But there can be arbitrarily many characters before that
parenthesis, and thus looking a given number of characters ahead
doesn't work.
And so an unfortunate side-effect of the modifications is that something
that starts with "url(" must now either be a URI token or an error,
while previously the tokenizer would back up and re-parse it as a
FUNCTION. No part of CSS relied on "url()" sometimes being a FUNCTION,
of course, but in theory some private extension could have used this
hack.
Proposed text:
I tested the modifications with flex and they appear to work. Valid
style sheets are still valid and invalid ones are still invalid (with
the exception of the "url(" issue above). I propose the following
changes to chapter 4[2] and appendix G[3]:
* In 4.1.1, in the table of tokens, change
INVALID {invalid}
to
BAD_STRING {badstring}
BAD_URI {baduri}
BAD_COMMENT {badcomment}
* In the table with macros, change
invalid {invalid1}|{invalid2}
invalid1 \"([^\n\r\f\\"]|\\{nl}|{escape})*
invalid2 \'([^\n\r\f\\']|\\{nl}|{escape})*
to
badstring {badstring1}|{badstring2}
badstring1 \"([^\n\r\f\\"]|\\{nl}|{escape})*\\?
badstring2 \'([^\n\r\f\\']|\\{nl}|{escape})*\\?
* Add the following macros to that same table:
badcomment {badcomment1}|{badcomment2}
badcomment1 \/\*[^*]*\*+([^/*][^*]*\*+)*
badcomment2 \/\*[^*]*(\*+[^/*][^*]*)*
baduri {baduri1}|{baduri2}|{baduri3}
baduri1 url\({w}([!#$%&*-~]|{nonascii}|{escape})*{w}
baduri2 url\({w}{string}{w}
baduri3 url\({w}{badstring}
* In section G.2, change
invalid1 \"([^\n\r\f\\"]|\\{nl}|{escape})*
invalid2 \'([^\n\r\f\\']|\\{nl}|{escape})*
to
badstring1 \"([^\n\r\f\\"]|\\{nl}|{escape})*\\?
badstring2 \'([^\n\r\f\\']|\\{nl}|{escape})*\\?
badcomment1 \/\*[^*]*\*+([^/*][^*]*\*+)*
badcomment2 \/\*[^*]*(\*+[^/*][^*]*)*
baduri1 url\({w}([!#$%&*-~]|{nonascii}|{escape})*{w}
baduri2 url\({w}{string}{w}
baduri3 url\({w}{badstring}
* A few lines down, change
invalid {invalid1}|{invalid2}
to
badstring {badstring1}|{badstring2}
badcomment {badcomment1}|{badcomment2}
baduri {baduri1}|{baduri2}|{baduri3}
* Further down again, change
{invalid} {return INVALID; /* unclosed string */}
to
{badstring} {return BAD_STRING);}
* After
\/\*[^*]*\*+([^/*][^*]*\*+)*\/ /* ignore comments */
add
{badcomment} /* unclosed comment at EOF */
* After
{U}{R}{L}"("{w}{string}{w}")" {return URI;}
{U}{R}{L}"("{w}{url}{w}")" {return URI;}
add
{baduri} {return BAD_URI);}
* Insert this new section:
G.4 Implementation note
This section is non-normative.
The lexical scanner for the CSS core syntax in section 4.1.1 can be
implemented as a scanner without back-up. In Lex notation, that
requires the addition of the following patterns (which do not change
the returned tokens, only the efficiency of the scanner):
{ident}/\\ return IDENT;
#{name}/\\ return HASH;
@{ident}/\\ return ATKEYWORD;
#/\\ return DELIM;
@/\\ return DELIM;
@/- return DELIM;
@/-\\ return DELIM;
-/\\ return DELIM;
-/- return DELIM;
\</! return DELIM;
\</!- return DELIM;
{num}{ident}/\\ return DIMENSION;
{num}/\\ return NUMBER;
{num}/- return NUMBER;
{num}/-\\ return NUMBER;
[0-9]+/\. return NUMBER;
u/\+ return IDENT;
u\+[0-9a-f?]{1,6}/- return UNICODE_RANGE;
[1] http://wiki.csswg.org/spec/css2.1#issue-129
[2] http://www.w3.org/TR/2009/CR-CSS2-20090908/syndata.html
[3] http://www.w3.org/TR/2009/CR-CSS2-20090908/grammar.html
Bert
PS. For people who want to experiment, I attached a flex scanner with
the core tokenizer (section 4.1.1) and the added patterns (from the new
section G.4 above). To test it with, I also attached a bison grammar
(that also implements some of the error recovery rules). Compile with
bison -d css.y
mv css.tab.h css.h
flex scan3.l
cc lex.yy.c css.tab.c -ly
Run it with
./a.out < some-style-sheet.css
e.g.,
./a.out <<< "p { color: red }"
./a.out <<< "p { content: url((not-valid)) }"
The program outputs the style sheet minus any errors (and formatted
differently).
Try "flex -F -v scan3.l" to confirm that the tokenizer indeed doesn't
need backing up (but gets 6 times as big instead :-) ).
--
Bert Bos ( W 3 C ) http://www.w3.org/
http://www.w3.org/people/bos W3C/ERCIM
bert@w3.org 2004 Rt des Lucioles / BP 93
+33 (0)4 92 38 76 92 06902 Sophia Antipolis Cedex, France
Attachments
Received on Wednesday, 18 August 2010 20:29:58 UTC