[css-syntax] Comments on the preprocessing and tokenizer from Simon Sapin on 2013-05-26 (www-style@w3.org from May 2013)

From: Simon Sapin <simon.sapin@exyr.org>
Date: Sun, 26 May 2013 11:11:07 +0800
To: www-style list <www-style@w3.org>
Message-ID: <51A17D4B.6060603@exyr.org>
In a bunch of places, the work "token" seems to have been accidentally 
removed when switching to the 〈〉 notation. For example: "Emit a 〈(〉."


§2

     Each declaration […] finished with a semicolon.

→ Declarations are separated by semicolons.
This makes a difference for the last declaration of a block.
(If not applying this change, s/finished/finishes/)


     They can have CSS values following their name,
     but they end with a {}-wrapped block, similar to a rule.

s/rule/qualified rule/ ?
Same in the next sentence.


§3, §3.1

     User agents must use the parsing rules described in this
     specification to generate the CSSOM trees from text/css resources.

     The output is a CSSStyleSheet object.

Is Syntax expected to gain another section that describes how to build
a CSSOM tree? If not remove mentions of CSSOM here.

§3.2

     The stream of Unicode code points […] will be initially seen
     by the user agent as a stream of bytes

s/will/may/
Eg. for HTML <style> elements, the CSS parser gets text nodes’ parsed
Unicode value from the HTML parser, but never sees bytes.


§4.2

This section should define "character" as a single Unicode codepoint.
(Other CSS modules such as Text may have a different definition.)


Now that "non-ASCII" starts at U+0080, "non-printable" should be changed
to stop at U+007F and not include U+0080 to U+009F.

Also, "non-printable" should include U+000B LINE TABULATION in order
to match CSS 2.1’s definition of unquoted URL token.


     Note that U+000D CARRIAGE RETURN and U+000C FORM FEED are not
     included in this definition, as they are removed from the stream
     during preprocessing.

s/removed from the stream/converted to U+000A LINE FEED/
And link "preprocessing" to the relevant section.


§4.3.1

     U+0023 NUMBER SIGN (#)
     If the next three input characters would start an identifier
     or would start a number, create a 〈hash〉 token

This isn’t right, as it creates a hash tokens for these:
#.1 #+1 #+.1
Instead, you want "If the next input character is a name character
or the next two input character are a valid escape, create a 〈hash〉 token"


In the data state, U+002B PLUS SIGN (+) and U+002E FULL STOP (.)
do the same thing and could be merged.
Or do you prefer keeping codepoint order?


§4.3.7. Ident state

"〈input〉" looks like a token type but maybe shouldn’t?


§4.3.11. Number-end state

This section could be simplified by directy checking "starts with an 
identifier" and not special-case name-start, \ and -


§4.3.18. Bad-URL state

     EOF
     This is a parse error.

This is probably not needed: whatever condition had the state machine
switch to the bad-URL state was already a parse error.


     U+005C REVERSE SOLIDUS

This doesn’t do anything. It could be removed and let the "anything 
else" clause do its job.

-- 
Simon Sapin
Received on Sunday, 26 May 2013 03:11:39 UTC