[csswg-drafts] [css-syntax] Review requested of new Parsing text (#8834)

tabatkins has just created a new issue for https://github.com/w3c/csswg-drafts:

== [css-syntax] Review requested of new Parsing text ==
Since we resolved on the new Nesting behavior (try to parse as a declaration, then parse as a rule if that's invalid), I had to do some decent rewriting of Syntax's algorithms, and went ahead and dove into a larger rewrite to clean it up in general. I've implemented the new text in [my CSS parser library](https://github.com/tabatkins/parse-css) and the (fairly limited, admittedly) tests I've run look good, but I'd appreciate a larger review.

Significant changes from the previous version:

* Algorithm structure generally changed; rather than consuming a token and often reconsuming for another algorithm to deal with, it just always relies on lookahead and doesn't consume tokens until they're actually going to be used for certain. This should better resemble how an actual parser works. (I haven't changed the tokenizer to this structure, but doing so is probably a good idea at some point.)
* Previously, I had "consume a list of rules" for rule+at-rules and "consume a list of declarations" for declarations+at-rules. Stylesheets and things like @media used "list of rules"; style rules and things like @font-face used "list of declarations". I've shifted *all* blocks to just use the new "consume a block's contents", and since stylesheets are now the only user of "list of rules", renamed it to "consume a stylesheet's contents" and specialized it to always ignore the CDO/CDC tokens.

------

Aside from allowing the new nesting behavior, all of these changes *should* be only editorial, with one exception: blocks that previously only contained rules (@media, @keyframes, etc) previously used the "consume a list of rules", but now use the unified "consume a block's contents", which means their error-recovery in the face of semicolons changes.

For example, `@media { garbage; bar {...} }` previously would contain a style rule with a `garbage; bar` selector. (This is what happens at the top-level of a stylesheet, still.) Now the rule's selector will be just `bar`, since the `garbage;` part will get dropped as an invalid declaration. This means that rules which were accidentally invalid and dropped due to garbage might now be valid, if there's a semicolon separating them from preceding garbage.

I *suspect* this is fine, and I'd really like it to be, because it means the overall parsing behavior doesn't need to branch on grammar knowledge (and thus, whether a rule is known or unknown won't change its generic parsing). It used to be the case that parsing depended on this kind of knowledge, and it was *super* awkward to use in tooling.

But if necessary, we can hardcode some at-rules to trigger a different parsing behavior that preserves backwards compatibility more completely.

(Technically parsing *in general* depends on grammar knowledge anyway, since you need to know whether a declaration is valid in a given context to tell if you should try and redo parsing as a rule. But it turns out there's a simple and reliable rule you can use generically to get approximately the right behavior without having to know anything about grammars.)

Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/8834 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Friday, 12 May 2023 22:13:24 UTC