Re: [css-syntax] Comments on the preprocessing and tokenizer

On Sat, May 25, 2013 at 8:11 PM, Simon Sapin <> wrote:
> In a bunch of places, the work "token" seems to have been accidentally
> removed when switching to the 〈〉 notation. For example: "Emit a 〈(〉."

That was intentional, just like how we talk about "return the
<string>", not "return the <string> value".  The few places where the
word "token" is still around are a mistake.

> §2
>     Each declaration […] finished with a semicolon.
> → Declarations are separated by semicolons.
> This makes a difference for the last declaration of a block.
> (If not applying this change, s/finished/finishes/)

Accepted your change.

>     They can have CSS values following their name,
>     but they end with a {}-wrapped block, similar to a rule.
> s/rule/qualified rule/ ?
> Same in the next sentence.


> §3, §3.1
>     User agents must use the parsing rules described in this
>     specification to generate the CSSOM trees from text/css resources.
>     The output is a CSSStyleSheet object.
> Is Syntax expected to gain another section that describes how to build
> a CSSOM tree? If not remove mentions of CSSOM here.

Hm, maybe.  I should discuss this with zcorpan and see how much needs
to be defined.

> §3.2
>     The stream of Unicode code points […] will be initially seen
>     by the user agent as a stream of bytes
> s/will/may/
> Eg. for HTML <style> elements, the CSS parser gets text nodes’ parsed
> Unicode value from the HTML parser, but never sees bytes.


> §4.2
> This section should define "character" as a single Unicode codepoint.
> (Other CSS modules such as Text may have a different definition.)
> Now that "non-ASCII" starts at U+0080, "non-printable" should be changed
> to stop at U+007F and not include U+0080 to U+009F.
> Also, "non-printable" should include U+000B LINE TABULATION in order
> to match CSS 2.1’s definition of unquoted URL token.
>     Note that U+000D CARRIAGE RETURN and U+000C FORM FEED are not
>     included in this definition, as they are removed from the stream
>     during preprocessing.
> s/removed from the stream/converted to U+000A LINE FEED/
> And link "preprocessing" to the relevant section.


> §4.3.1
>     U+0023 NUMBER SIGN (#)
>     If the next three input characters would start an identifier
>     or would start a number, create a 〈hash〉 token
> This isn’t right, as it creates a hash tokens for these:
> #.1 #+1 #+.1
> Instead, you want "If the next input character is a name character
> or the next two input character are a valid escape, create a 〈hash〉 token"

Ah, indeed.  Fixed.

> In the data state, U+002B PLUS SIGN (+) and U+002E FULL STOP (.)
> do the same thing and could be merged.
> Or do you prefer keeping codepoint order?

I keep codepoint order, at least for the data state.  It's big enough
that maintaining that order keeps it easier to read, I think.

> §4.3.7. Ident state
> "〈input〉" looks like a token type but maybe shouldn’t?

Huh, that must be an old bug.  Fixed.

> §4.3.11. Number-end state
> This section could be simplified by directy checking "starts with an
> identifier" and not special-case name-start, \ and -


> §4.3.18. Bad-URL state
>     EOF
>     This is a parse error.
> This is probably not needed: whatever condition had the state machine
> switch to the bad-URL state was already a parse error.

You're right.  Removed.

> This doesn’t do anything. It could be removed and let the "anything else"
> clause do its job.

Nope; it prevents an escaped ) character from ending the bad-url.

> Additionally, section 3.2.1. "Preprocessing the input stream" is relevant
> even when we’re parsing from Unicode text (eg. text nodes in an HTML <style>
> element) rather than bytes, and therefore should not be under 3.2. "The
> input byte stream".

Lifted it up to 3.3.


Received on Wednesday, 29 May 2013 02:06:44 UTC