[CSS21] Backing up in tokenizer (issue 129)

I had an action to write up the text for issue 129[1], which implements 
Zack Weinberg's modifications to the CSS tokenizer.

Background:

The goal of those modifications was to avoid that a lexical scanner had 
to "back-up." E.g., the 2-character input "@-" could be the start of an 
at-keyword, but if it isn't followed by a letter, the tokenizer has to 
go back and treat it as two separate DELIM tokens instead. Avoiding 
back-up gives a (tiny) bit of gain in speed.

In most cases, such as "@-", looking ahead at most two characters solves 
the issue. I.e., it's an implementation issue, not an issue with CSS. 
But one case, the url(...) token, is different. Something 
like "url((a,b))" is *not* a URI token, because of the illegal 
second "(". But there can be arbitrarily many characters before that 
parenthesis, and thus looking a given number of characters ahead 
doesn't work.

And so an unfortunate side-effect of the modifications is that something 
that starts with "url(" must now either be a URI token or an error, 
while previously the tokenizer would back up and re-parse it as a 
FUNCTION. No part of CSS relied on "url()" sometimes being a FUNCTION, 
of course, but in theory some private extension could have used this 
hack.

Proposed text:

I tested the modifications with flex and they appear to work. Valid 
style sheets are still valid and invalid ones are still invalid (with 
the exception of the "url(" issue above). I propose the following 
changes to chapter 4[2] and appendix G[3]:

* In 4.1.1, in the table of tokens, change

    INVALID        {invalid}
to
    BAD_STRING     {badstring}
    BAD_URI        {baduri}
    BAD_COMMENT    {badcomment}

* In the table with macros, change

    invalid        {invalid1}|{invalid2}
    invalid1       \"([^\n\r\f\\"]|\\{nl}|{escape})*
    invalid2       \'([^\n\r\f\\']|\\{nl}|{escape})*
to
    badstring      {badstring1}|{badstring2}
    badstring1     \"([^\n\r\f\\"]|\\{nl}|{escape})*\\?
    badstring2     \'([^\n\r\f\\']|\\{nl}|{escape})*\\?

* Add the following macros to that same table:

    badcomment   {badcomment1}|{badcomment2}
    badcomment1   \/\*[^*]*\*+([^/*][^*]*\*+)*
    badcomment2   \/\*[^*]*(\*+[^/*][^*]*)*
    baduri        {baduri1}|{baduri2}|{baduri3}
    baduri1       url\({w}([!#$%&*-~]|{nonascii}|{escape})*{w}
    baduri2       url\({w}{string}{w}
    baduri3       url\({w}{badstring}

* In section G.2, change

    invalid1        \"([^\n\r\f\\"]|\\{nl}|{escape})*
    invalid2        \'([^\n\r\f\\']|\\{nl}|{escape})*
to
    badstring1      \"([^\n\r\f\\"]|\\{nl}|{escape})*\\?
    badstring2      \'([^\n\r\f\\']|\\{nl}|{escape})*\\?
    badcomment1     \/\*[^*]*\*+([^/*][^*]*\*+)*
    badcomment2     \/\*[^*]*(\*+[^/*][^*]*)*
    baduri1         url\({w}([!#$%&*-~]|{nonascii}|{escape})*{w}
    baduri2         url\({w}{string}{w}
    baduri3         url\({w}{badstring}

* A few lines down, change

    invalid         {invalid1}|{invalid2}
to
    badstring       {badstring1}|{badstring2}
    badcomment      {badcomment1}|{badcomment2}
    baduri          {baduri1}|{baduri2}|{baduri3}

* Further down again, change

    {invalid}               {return INVALID; /* unclosed string */}
to
    {badstring}             {return BAD_STRING);}

* After

    \/\*[^*]*\*+([^/*][^*]*\*+)*\/          /* ignore comments */
add
    {badcomment}                         /* unclosed comment at EOF */

* After

    {U}{R}{L}"("{w}{string}{w}")"   {return URI;}
    {U}{R}{L}"("{w}{url}{w}")"      {return URI;}
add
    {baduri}                        {return BAD_URI);}

* Insert this new section:

    G.4 Implementation note

    This section is non-normative.

    The lexical scanner for the CSS core syntax in section 4.1.1 can be
    implemented as a scanner without back-up. In Lex notation, that
    requires the addition of the following patterns (which do not change
    the returned tokens, only the efficiency of the scanner):

    {ident}/\\          return IDENT;
    #{name}/\\          return HASH;
    @{ident}/\\         return ATKEYWORD;
    #/\\                return DELIM;
    @/\\                return DELIM;
    @/-                 return DELIM;
    @/-\\               return DELIM;
    -/\\                return DELIM;
    -/-                 return DELIM;
    \</!                return DELIM;
    \</!-               return DELIM;
    {num}{ident}/\\     return DIMENSION;
    {num}/\\            return NUMBER;
    {num}/-             return NUMBER;
    {num}/-\\           return NUMBER;
    [0-9]+/\.           return NUMBER;
    u/\+                return IDENT;
    u\+[0-9a-f?]{1,6}/- return UNICODE_RANGE;



[1] http://wiki.csswg.org/spec/css2.1#issue-129
[2] http://www.w3.org/TR/2009/CR-CSS2-20090908/syndata.html
[3] http://www.w3.org/TR/2009/CR-CSS2-20090908/grammar.html



Bert

PS. For people who want to experiment, I attached a flex scanner with 
the core tokenizer (section 4.1.1) and the added patterns (from the new 
section G.4 above). To test it with, I also attached a bison grammar 
(that also implements some of the error recovery rules). Compile with

    bison -d css.y
    mv css.tab.h css.h
    flex scan3.l
    cc lex.yy.c css.tab.c -ly

Run it with

    ./a.out < some-style-sheet.css

e.g.,

    ./a.out <<< "p { color: red }"
    ./a.out <<< "p { content: url((not-valid)) }"

The program outputs the style sheet minus any errors (and formatted 
differently).

Try "flex -F -v scan3.l" to confirm that the tokenizer indeed doesn't 
need backing up (but gets 6 times as big instead :-) ).

-- 
  Bert Bos                                ( W 3 C ) http://www.w3.org/
  http://www.w3.org/people/bos                               W3C/ERCIM
  bert@w3.org                             2004 Rt des Lucioles / BP 93
  +33 (0)4 92 38 76 92            06902 Sophia Antipolis Cedex, France

Received on Wednesday, 18 August 2010 20:29:58 UTC