Re: [css3-syntax] Critique of Feb 15 draft from Simon Sapin on 2013-02-18 (www-style@w3.org from February 2013)

From: Simon Sapin <simon.sapin@kozea.fr>
Date: Mon, 18 Feb 2013 02:47:44 +0100
To: Zack Weinberg <zackw@panix.com>
CC: www-style@w3.org
Message-ID: <51218840.7020806@kozea.fr>
Le 17/02/2013 19:08, Zack Weinberg a écrit :
> §3: Typo: "described in this specifiction" -> specification.

Fixed, thanks. Although the idea of speci-fiction is fun :)


> §3.2: As a security precaution, UAs should skip steps 3 and 4 of the
> algorithm to determine the fallback encoding when the referring
> stylesheet or document is not _same-origin_ with the stylesheet (refer
> to RFC 6454 for the definition of _same-origin_).  See
> <https://www.w3.org/Bugs/Public/show_bug.cgi?id=20886>.

Sounds sensible, although I don’t know much about this to judge. Tab?


> Issue 1 refers to steps 4 and 5 but that doesn't make sense in
> context; typo for 3 and 4?

Fixed. The algorithm was rewritten with one less step in 
https://dvcs.w3.org/hg/csswg/rev/550ce5a48ffb but the issue not updated.

(I’m only fixing obvious typos as I’m not officially an editor of this 
module.)


> §3.2.1: It is unclear to me why CR and CR LF are normalized to LF at
> this stage but FF isn't.  It might actually be simpler and clearer to
> define a notion of "one vertical whitespace sequence" and refer to
> that only in the small number of tokenizer states that can consume
> vertical whitespace.

AFAICT it doesn’t make a difference, since newlines are not allowed in 
quoted strings or in backslash-escapes. But I don’t see how vertical 
whitespace is a better term than newline.


> Similarly, it might be simpler and clearer to
> map U+0000 to U+FFFD only in the small number of tokenizer states that
> can *consume* U+0000.

Not sure it would be easier: as non-ASCII, U+FFFD is a "name character" 
allowed in ident and ident-like tokens.


> The last two paragraphs in this section conceptually belong to the
> "Tokenization" section.

Agreed.


> §4: The algorithm in this section might benefit from conversion to a
> recursive-descent presentation as was already done for the parser (§5).

Agreed. In particular, url tokens could reuse the definition of quoted 
strings, at-keyword and dimension that of ident, and hash and ident 
could both be based on what CSS 2.1 calls a "name".


> It might also benefit from a revision which consistently treated the
> "current input character" as _not_ already consumed; only consume
> characters when you have committed to including them in the current
> token.  I just made an analogous change throughout Gecko's tokenizer
> (https://bugzilla.mozilla.org/show_bug.cgi?id=543151 ) and I think
> readability is improved quite a bit.

My rust implementation has both "consume a character" and "current 
character" functions, but occasionally still needs to "go back" one 
character. Maybe "going back" could be avoided entirely with more 
look-ahead, but I’m not sure the result is simpler.


> CSS 2.1 lumped colon, semicolon, comma, [, ], (, ), {, and } tokens in
> with delim tokens.  I presume that they have been pulled out for ease
> of reference in the parser?  I don't think it will affect how
> implementations do things, but it should still be mentioned in §4.7 as
> a change from the previous spec.

Do you mean Appendix G? I don’t find very relevant anyway. It’s a 
super-set of what is valid in 2.1, but if you include some Level 3 
modules it’s neither a super-set nor a subset.

Section 4.1.1 (the normative one) does have tokens for these, except comma.


> If we are adjusting the tokenizer for the convenience of the parser
> anyway, it might make sense to merge CDO and CDC, which are always
> treated the same.

We recently made them "preserved" in selectors, preludes and declaration 
values. We had a proposal for a --> combinator in selectors, I suspect 
not knowing that it is a specific token.


> Since we are sanctioning css3-selectors' under-the-table change to the
> generic syntax, perhaps we should go further, and generalize the class
> of MATCH tokens so that any punctuation character other than :;,(){}[]
> followed by an equals sign is a single token?  That would allow
> further addition of selector match operators without having to come
> back and change the tokenizer again.

Could work, but this proposal needs an inclusive list of characters. By 
staying within ASCII, omitting :;,(){}[]"'\ to disambiguate with other 
CSS constructs and &< for ease of inclusion in an HTML <style> element, 
I came up with this:

!#%+-./=>?@_`


> In CSS 2.1 percentage tokens are unacceptable in any context where an
> <integer> is required, even if their numeric value is integral
> (e.g. '200%' == 2).  If this is not intended to change, then they
> should not be described as having an integer flag.

Percentage and dimension tokens are not <integer> values, but they 
indeed have a "type" flag that is not used in any current CSS property. 
But I guess it does not hurt. Implementations are free not to have it, 
since the difference is never visible.


> §4.2: This wart can and should be dealt with by the grammar rule for
> the `transform` attribute's value.  It does not need to be in the
> generic tokenizer.

Agreed that this would make more sense, but I don’t care much either way.


> A more significant SVG wart, which *does* require a tokenizer flag,
> is: in Gecko, IE, and Opera (but not Webkit), CSS comments are
> disabled when parsing an SVG mapped attribute.  For instance:
>
>      <circle cx="10" cy="10" r="5"
>        fill="#E0B0FF /* international danger mauve */"/>
>      <circle cx="20" cy="10" r="5"
>        style="fill:#E0B0FF /* international danger mauve */"/>
>
> Webkit draws both circles in mauve.  Gecko, IE, and Opera draw only
> the second one in mauve.  (Complete test case at
> <http://www.owlfolio.org/scratchpad/danger-mauve.svg>.)  It is not
> clear to me whether this behavior is required by the SVG spec (I only
> looked at SVG 1.1), and honestly I would be happy to take the mode bit
> out of Gecko's tokenizer.
>
> (Gecko's implementation simply cuts the comment rule out of the
> tokenizer, so the slash and star are tokenized as DELIMs, which then
> trigger a syntax error in the value parser.  I can't presently think
> of a test case that would detect whether other UAs do exactly the same
> thing.)

Which is the desired behavior? Which is specified? (I can’t find in the 
SVG spec if the syntax of the fill attribute is supposed to be CSS.)


> §4.4.2, 4.4.3 (double- and single-quote string state): EOF in these
> contexts is *not* a parse error in CSS 2.1.

The tokenizer in section 4.1.1 of CSS 2.1 makes these bad-string tokens, 
which the the Core Grammar in the same section does not allow in 
property values. This is by the way inconsistent with the "Unexpected 
end of style sheet" rule of section 4.2 of CSS 2.1. All browsers seem to 
do the latter. Test case:

data:text/html,<style>body:before{content:'pass


> […]
>
> (In general I would prefer that section 4 not speak of "parse errors";
> the tokenizer should be conceptualized as not having any error
> conditions at all.

This is already the case. Note that "parse errors" are not fatal, but 
only used in validators or for logging in developer tools. They are 
independent of whether a rule or declaration is ignored as invalid.


> All possible streams of input characters produce a
> well-defined stream of output tokens; it is the *parser* that always
> treats certain output tokens (bad-string and bad-url) as parse
> errors.)

Note that bad-string and bad-url are now "preserved" in preludes and 
declaration values. It’s up to higher-level parsers (Selectors, a given 
property value, …) to reject them. Selectors and Media Queries for 
example have different error handling.


> §4.4.4 (hash state): css3-selectors requires that ID selectors be
> limited to the subset of 'hash' tokens in which a valid CSS identifier
> follows the '#'.  Gecko implements this by making an internal
> distinction between 'hash' and 'id' tokens.  This might be worth
> adding to the official syntax -- perhaps only as a 'this hash token
> meets this requirement' flag bit.

Oh, this is bad. §6.5 of Selectors 3 mentions # followed by an ident, 
which is more restrictive than the grammar in §10.1 of the same spec. 
I’m not even sure the restriction is intentional.

Test case: (%231 is #1 url-encoded.)

data:text/html,<style>%231{}</style><script>document.write(document.styleSheets[0].cssRules[0].selectorText)</script>

Gecko and WebKit show nothing (the selector is invalid.) Presto shows 
#\31, where \31 is the hex escape for 1.


> §4.4.9 (ident state): If Gecko encounters the unusual sequence of
> characters "- \ EOF" or "- \ newline" (spaces for clarity only), the
> initial - is emitted as an ident token rather than a delim token.
> This is a bug according to both this spec and CSS 2.1, but can anyone
> think of a test case that would distinguish this behavior from the
> 'correct' behavior?

I was trying this, but it doesn’t work because delim(\) makes the 
declaration invalid regardless of the bug you describe.

data:text/html,<style>body:before{content: counter(\-); 
counter-increment: -\


> §4.4.12 (number state): This state appears to contain unnecessary
> sanity checks, since the data state will not transition *into* the
> number state without first checking that the initial sequence of
> characters forms a valid number.

Agree that some of it is unecessary. Just removing the "anything else" 
part would be weird if you don’t have all the context. I suggest "This 
case should never happen. This is a programming error."


> §4.4.13, §4.4.14 (number-rest, number-fraction): The numeric value of
> a percentage token should be "the number produced by interpreting the
> number token's representation as a base-10 number" *divided by 100*.
> (It might be better to state this in css3-values "Percentage" section
> instead, but that's already in CR and doesn't say anything of the sort
> right now.)

We can still make clarifications in CRs, but I think it’s pretty clear 
what a percentage means. In my mental model though, the division by 100 
only occurs when resolving a percentage relative to some other value. 
The percentage’s own "value" goes up to 100, not 1.


> §4.4.16 (sci-notation state): BUG: needs to allow for dimension and
> percentage tokens to contain scientific notation.  SVG 1.1 does in
> fact require this (section 4.2 -- see how <length> and <percentage>
> are defined in terms of <number>).

This was apparently deliberate:

http://lists.w3.org/Archives/Public/www-style/2012Aug/0900.html
http://lists.w3.org/Archives/Public/www-style/2013Jan/0303.html

But maybe we need to change it for compat with SVG.


> §4.4.17 through 4.4.21 (the URL states): As with strings, EOF in any
> of these states is not a parse error -- and produces a url token, not
> a bad-url token -- in CSS 2.1.  Again, I have no objection to the
> change as long as we are confident web compat is not affected.

My reading of CSS 2.1 (both in chapter 4 and Appendix G) shows that it 
does produce a bad-url token.


> §4.4.22 (bad URL state): It does simplify things to make this not
> balance brackets and braces, but has anyone working on a non-Webkit
> UA expressed an opinion on this change?

I don’t recall much discussions about this, but maybe I missed it.


> §4.7 What exactly does it mean to say "the comma token has been
> added"?  It was a DELIM in 2.1 but so were :;{}[]().

As above, not in chapter 4.


> §5 Two high-level points.
>
> First, white space's significance in selectors is a design error which
> should not be allowed to creep into other contexts.  It would be good,
> therefore, to exclude whitespace tokens from appearing in any list
> other than the prelude of a qualified rule.  This would also simplify
> many of the algorithms in this section, which would no longer need to
> mention that they discard whitespace.

In addition to dbaron’s point, we might want to use selectors in 
property values in the future.


> Second, why can't the generic grammar parse all at-rules as
>
> at-rule : AT-KEYWORD component-value* ( ';' | {}-block )
>
> and leave the rest to specific parsers for each at-rule?  That would
> eliminate the need for the "recognized at-rule list" and the
> distinction between rule-filed and declaration-filled at-rules.

I like this, and it actually became much more doable when we recently 
added an independent "Parse a list of declarations" entry point. We 
would also need "Parse a list of rules" but that’s easy.

Each at-rule would need to define that their value/body is parsed by 
calling one of these (or even something else). But they already do 
something similar for the prelude anyway.


> If a
> trailing {}-block is too generic, we could instead define
>
> generic-list : declaration ( ';' generic-list )?
>               | qualified-rule generic-list?
>               | at-rule generic-list?
>               | /*empty*/
>
> at-rule: AT-KEYWORD component-value* ( ';' | '{' generic-list '}' )
>
> but I think I'd prefer keeping the generic at-rule grammar as
> open-ended as possible.

Mixing declarations and qualified rules is bad as you need unbounded 
look-ahead to disambiguate which you’re parsing.


> Also, I have not carefully checked over the error recovery behavior
> make sure it is still consistent with CSS 2.1's explicit "rules for
> recovering from parse errors."  It might be a good idea to add back a
> similar explicit description of error recovery, if only so that future
> modules don't screw it up.

I would be in favor of having non-normative prose describing error 
recovery, just like the railroad diagrams describe valid syntax.


> §5.2 (quirks mode): The quirks defined in this spec should be thought
> of as augmenting the set of acceptable values for particular property
> declarations; they do not affect the generic grammar's determination
> of where one declaration ends and another begins.  Therefore, they
> belong with the definitions of <length> and <color>, or else with the
> definitions of the specific declarations to which they apply; they
> should not appear in this spec at all.

Same as the "transform function whitespace" flag. This would make more 
sense but I personally don’t care much.


> §5.4, issue 2: Gecko has several entry points other than the ones
> listed, but they are all for specific constructs: media list, color,
> selector, keyframe rule, keyframe selector, @supports declaration,
> @supports condition.  Therefore it seems inappropriate to me to
> mention them in this spec.  (Maybe there should be something about
> other CSS modules adding other entry points.)

My understanding is that all of these belong in other modules. They 
would themselves use one or more of Syntax’s entry points and work on 
rules, declarations, and component values; not tokens.


> §5.4.6 (comma-separated list of component values): "Append _temp_ to
> _val_" is ambiguous.  Suppose _val_ is empty and _temp_ is [a b c].
> Does appending _temp_ to _val_ set _val_ to [a b c] or to [ [a b c] ]?
> I believe that the latter is necessary for correct processing of
> e.g. the 'background' shorthand, but I can't tell whether this is
> intended to be used for that (since this algorithm is not actually
> used in this spec).

The latter. It’s "append", not "set". But I agree that this entry point 
is not terribly useful. ("Parse a list of component values" could be 
used instead.)


> §5.4.7 (an+b notation) I think this, too, belongs in the specific
> grammar where it is used, not here.  (I note that it is unreferenced
> by any other part of this specification.)

It could just as well be in Selectors 4, once that is switch to be based 
on Syntax 3.


> Also, above some very low complexity threshold, I find the verbosity
> and lack of explicit structure in HTML5-style prose algorithms to be a
> serious obstacle to understanding.  In the rest of the spec this is
> not a problem, but I would have to rewrite this section in traditional
> pseudocode in order to figure out whether it was correct.

I don’t know any other style that allow specifying all of the error 
handling without being much more complex. For example, a "total" grammar 
for CSS (so that there is no non-matching input) would be unreadable. 
The railroad diagrams are meant to help with the high-level 
understanding. I could write on up for an+b.


> §6 (Serialization): If my suggestion of not explicitly including
> whitespace tokens in lists other than qualified rule preludes is
> taken, then this should be changed to insert whitespace rather than an
> empty comment if the list being serialized is not a QR-prelude list.

I don’t like serialization being context-sensitive. (Although this is 
not a strong opinion.)


Thanks for the review!
-- 
Simon Sapin
Received on Monday, 18 February 2013 01:48:12 UTC