- From: Zack Weinberg <zackw@panix.com>
- Date: Sun, 17 Feb 2013 13:08:55 -0500
- To: www-style@w3.org
I haven't been following discussion up till now, so I apologize if this hits something that was already discussed and resolved. Most of my comments are fairly minor, but there are two things which I would describe as outright bugs: the missing SVG wart and the lack of support for scientific notation in DIMENSION and PERCENTAGE. I went over the tokenizer a lot more carefully than the parser, just because I've been head down in Gecko's tokenizer recently and so that's what's crisp in my mind. §3: Typo: "described in this specifiction" -> specification. §3.2: As a security precaution, UAs should skip steps 3 and 4 of the algorithm to determine the fallback encoding when the referring stylesheet or document is not _same-origin_ with the stylesheet (refer to RFC 6454 for the definition of _same-origin_). See <https://www.w3.org/Bugs/Public/show_bug.cgi?id=20886>. Issue 1 refers to steps 4 and 5 but that doesn't make sense in context; typo for 3 and 4? §3.2.1: It is unclear to me why CR and CR LF are normalized to LF at this stage but FF isn't. It might actually be simpler and clearer to define a notion of "one vertical whitespace sequence" and refer to that only in the small number of tokenizer states that can consume vertical whitespace. Similarly, it might be simpler and clearer to map U+0000 to U+FFFD only in the small number of tokenizer states that can *consume* U+0000. The last two paragraphs in this section conceptually belong to the "Tokenization" section. §4: The algorithm in this section might benefit from conversion to a recursive-descent presentation as was already done for the parser (§5). It might also benefit from a revision which consistently treated the "current input character" as _not_ already consumed; only consume characters when you have committed to including them in the current token. I just made an analogous change throughout Gecko's tokenizer (https://bugzilla.mozilla.org/show_bug.cgi?id=543151 ) and I think readability is improved quite a bit. CSS 2.1 lumped colon, semicolon, comma, [, ], (, ), {, and } tokens in with delim tokens. I presume that they have been pulled out for ease of reference in the parser? I don't think it will affect how implementations do things, but it should still be mentioned in §4.7 as a change from the previous spec. If we are adjusting the tokenizer for the convenience of the parser anyway, it might make sense to merge CDO and CDC, which are always treated the same. Since we are sanctioning css3-selectors' under-the-table change to the generic syntax, perhaps we should go further, and generalize the class of MATCH tokens so that any punctuation character other than :;,(){}[] followed by an equals sign is a single token? That would allow further addition of selector match operators without having to come back and change the tokenizer again. (The major reason I can think of *not* to do this is that it might cause trouble for future extensions to calc(). But C does fine with a bunch of x= operators, so I doubt it will be a problem in practice.) In CSS 2.1 percentage tokens are unacceptable in any context where an <integer> is required, even if their numeric value is integral (e.g. '200%' == 2). If this is not intended to change, then they should not be described as having an integer flag. §4.2: This wart can and should be dealt with by the grammar rule for the `transform` attribute's value. It does not need to be in the generic tokenizer. A more significant SVG wart, which *does* require a tokenizer flag, is: in Gecko, IE, and Opera (but not Webkit), CSS comments are disabled when parsing an SVG mapped attribute. For instance: <circle cx="10" cy="10" r="5" fill="#E0B0FF /* international danger mauve */"/> <circle cx="20" cy="10" r="5" style="fill:#E0B0FF /* international danger mauve */"/> Webkit draws both circles in mauve. Gecko, IE, and Opera draw only the second one in mauve. (Complete test case at <http://www.owlfolio.org/scratchpad/danger-mauve.svg>.) It is not clear to me whether this behavior is required by the SVG spec (I only looked at SVG 1.1), and honestly I would be happy to take the mode bit out of Gecko's tokenizer. (Gecko's implementation simply cuts the comment rule out of the tokenizer, so the slash and star are tokenized as DELIMs, which then trigger a syntax error in the value parser. I can't presently think of a test case that would detect whether other UAs do exactly the same thing.) §4.4.2, 4.4.3 (double- and single-quote string state): EOF in these contexts is *not* a parse error in CSS 2.1. I have never liked the "end of file closes all open constructs without error" rule in CSS 2.1, and have no objection to EOF being made into a parse error inside a string, as long as we are confident there is no web-compat issue, particularly if we're scrapping the "end of file closes all open constructs without error" rule in its *entirety*. (It appears from §5 that we aren't.) However, then the tokenizer should emit a bad-string instead of a string when it encounters EOF in these states. (In general I would prefer that section 4 not speak of "parse errors"; the tokenizer should be conceptualized as not having any error conditions at all. All possible streams of input characters produce a well-defined stream of output tokens; it is the *parser* that always treats certain output tokens (bad-string and bad-url) as parse errors.) §4.4.4 (hash state): css3-selectors requires that ID selectors be limited to the subset of 'hash' tokens in which a valid CSS identifier follows the '#'. Gecko implements this by making an internal distinction between 'hash' and 'id' tokens. This might be worth adding to the official syntax -- perhaps only as a 'this hash token meets this requirement' flag bit. §4.4.9 (ident state): If Gecko encounters the unusual sequence of characters "- \ EOF" or "- \ newline" (spaces for clarity only), the initial - is emitted as an ident token rather than a delim token. This is a bug according to both this spec and CSS 2.1, but can anyone think of a test case that would distinguish this behavior from the 'correct' behavior? §4.4.10, 4.4.11 (transform function whitespace): see comments on §4.2 above. §4.4.12 (number state): This state appears to contain unnecessary sanity checks, since the data state will not transition *into* the number state without first checking that the initial sequence of characters forms a valid number. §4.4.13, §4.4.14 (number-rest, number-fraction): The numeric value of a percentage token should be "the number produced by interpreting the number token's representation as a base-10 number" *divided by 100*. (It might be better to state this in css3-values "Percentage" section instead, but that's already in CR and doesn't say anything of the sort right now.) §4.4.16 (sci-notation state): BUG: needs to allow for dimension and percentage tokens to contain scientific notation. SVG 1.1 does in fact require this (section 4.2 -- see how <length> and <percentage> are defined in terms of <number>). §4.4.17 through 4.4.21 (the URL states): As with strings, EOF in any of these states is not a parse error -- and produces a url token, not a bad-url token -- in CSS 2.1. Again, I have no objection to the change as long as we are confident web compat is not affected. §4.4.22 (bad URL state): It does simplify things to make this not balance brackets and braces, but has anyone working on a non-Webkit UA expressed an opinion on this change? §4.7 What exactly does it mean to say "the comma token has been added"? It was a DELIM in 2.1 but so were :;{}[](). §5 Two high-level points. First, white space's significance in selectors is a design error which should not be allowed to creep into other contexts. It would be good, therefore, to exclude whitespace tokens from appearing in any list other than the prelude of a qualified rule. This would also simplify many of the algorithms in this section, which would no longer need to mention that they discard whitespace. Second, why can't the generic grammar parse all at-rules as at-rule : AT-KEYWORD component-value* ( ';' | {}-block ) and leave the rest to specific parsers for each at-rule? That would eliminate the need for the "recognized at-rule list" and the distinction between rule-filed and declaration-filled at-rules. If a trailing {}-block is too generic, we could instead define generic-list : declaration ( ';' generic-list )? | qualified-rule generic-list? | at-rule generic-list? | /*empty*/ at-rule: AT-KEYWORD component-value* ( ';' | '{' generic-list '}' ) but I think I'd prefer keeping the generic at-rule grammar as open-ended as possible. Also, I have not carefully checked over the error recovery behavior make sure it is still consistent with CSS 2.1's explicit "rules for recovering from parse errors." It might be a good idea to add back a similar explicit description of error recovery, if only so that future modules don't screw it up. §5.2 (quirks mode): The quirks defined in this spec should be thought of as augmenting the set of acceptable values for particular property declarations; they do not affect the generic grammar's determination of where one declaration ends and another begins. Therefore, they belong with the definitions of <length> and <color>, or else with the definitions of the specific declarations to which they apply; they should not appear in this spec at all. §5.4, issue 2: Gecko has several entry points other than the ones listed, but they are all for specific constructs: media list, color, selector, keyframe rule, keyframe selector, @supports declaration, @supports condition. Therefore it seems inappropriate to me to mention them in this spec. (Maybe there should be something about other CSS modules adding other entry points.) (For the curious looking at the code, the difference between nsCSSParser::ParseStyleAttribute and ::ParseDeclarations is only the argument list, not the grammar, AFAICT.) §5.4.6 (comma-separated list of component values): "Append _temp_ to _val_" is ambiguous. Suppose _val_ is empty and _temp_ is [a b c]. Does appending _temp_ to _val_ set _val_ to [a b c] or to [ [a b c] ]? I believe that the latter is necessary for correct processing of e.g. the 'background' shorthand, but I can't tell whether this is intended to be used for that (since this algorithm is not actually used in this spec). §5.4.7 (an+b notation) I think this, too, belongs in the specific grammar where it is used, not here. (I note that it is unreferenced by any other part of this specification.) Also, above some very low complexity threshold, I find the verbosity and lack of explicit structure in HTML5-style prose algorithms to be a serious obstacle to understanding. In the rest of the spec this is not a problem, but I would have to rewrite this section in traditional pseudocode in order to figure out whether it was correct. §5.5.2, 5.5.3 (at-statement, at-rule): Per recommendation above, these should be collapsed together. §5.5.8, 5.5.9 (quirks): Per recommendation above, these should be removed. §5.5.11 (functions): Per recommendation above, language about quirks and 'rect' should be removed. §6 (Serialization): If my suggestion of not explicitly including whitespace tokens in lists other than qualified rule preludes is taken, then this should be changed to insert whitespace rather than an empty comment if the list being serialized is not a QR-prelude list. zw
Received on Sunday, 17 February 2013 18:09:17 UTC