[css3-syntax] Critique of Feb 15 draft

I haven't been following discussion up till now, so I apologize if
this hits something that was already discussed and resolved.  Most of
my comments are fairly minor, but there are two things which I would
describe as outright bugs: the missing SVG wart and the lack of
support for scientific notation in DIMENSION and PERCENTAGE.

I went over the tokenizer a lot more carefully than the parser, just
because I've been head down in Gecko's tokenizer recently and so
that's what's crisp in my mind.

§3: Typo: "described in this specifiction" -> specification.

§3.2: As a security precaution, UAs should skip steps 3 and 4 of the
algorithm to determine the fallback encoding when the referring
stylesheet or document is not _same-origin_ with the stylesheet (refer
to RFC 6454 for the definition of _same-origin_).  See
<https://www.w3.org/Bugs/Public/show_bug.cgi?id=20886>.

Issue 1 refers to steps 4 and 5 but that doesn't make sense in
context; typo for 3 and 4?

§3.2.1: It is unclear to me why CR and CR LF are normalized to LF at
this stage but FF isn't.  It might actually be simpler and clearer to
define a notion of "one vertical whitespace sequence" and refer to
that only in the small number of tokenizer states that can consume
vertical whitespace.  Similarly, it might be simpler and clearer to
map U+0000 to U+FFFD only in the small number of tokenizer states that
can *consume* U+0000.

The last two paragraphs in this section conceptually belong to the
"Tokenization" section.

§4: The algorithm in this section might benefit from conversion to a
recursive-descent presentation as was already done for the parser (§5).

It might also benefit from a revision which consistently treated the
"current input character" as _not_ already consumed; only consume
characters when you have committed to including them in the current
token.  I just made an analogous change throughout Gecko's tokenizer
(https://bugzilla.mozilla.org/show_bug.cgi?id=543151 ) and I think
readability is improved quite a bit.

CSS 2.1 lumped colon, semicolon, comma, [, ], (, ), {, and } tokens in
with delim tokens.  I presume that they have been pulled out for ease
of reference in the parser?  I don't think it will affect how
implementations do things, but it should still be mentioned in §4.7 as
a change from the previous spec.

If we are adjusting the tokenizer for the convenience of the parser
anyway, it might make sense to merge CDO and CDC, which are always
treated the same.

Since we are sanctioning css3-selectors' under-the-table change to the
generic syntax, perhaps we should go further, and generalize the class
of MATCH tokens so that any punctuation character other than :;,(){}[]
followed by an equals sign is a single token?  That would allow
further addition of selector match operators without having to come
back and change the tokenizer again.  (The major reason I can think of
*not* to do this is that it might cause trouble for future extensions
to calc().  But C does fine with a bunch of x= operators, so I doubt
it will be a problem in practice.)

In CSS 2.1 percentage tokens are unacceptable in any context where an
<integer> is required, even if their numeric value is integral
(e.g. '200%' == 2).  If this is not intended to change, then they
should not be described as having an integer flag.

§4.2: This wart can and should be dealt with by the grammar rule for
the `transform` attribute's value.  It does not need to be in the
generic tokenizer.

A more significant SVG wart, which *does* require a tokenizer flag,
is: in Gecko, IE, and Opera (but not Webkit), CSS comments are
disabled when parsing an SVG mapped attribute.  For instance:

    <circle cx="10" cy="10" r="5"
      fill="#E0B0FF /* international danger mauve */"/>
    <circle cx="20" cy="10" r="5"
      style="fill:#E0B0FF /* international danger mauve */"/>

Webkit draws both circles in mauve.  Gecko, IE, and Opera draw only
the second one in mauve.  (Complete test case at
<http://www.owlfolio.org/scratchpad/danger-mauve.svg>.)  It is not
clear to me whether this behavior is required by the SVG spec (I only
looked at SVG 1.1), and honestly I would be happy to take the mode bit
out of Gecko's tokenizer.

(Gecko's implementation simply cuts the comment rule out of the
tokenizer, so the slash and star are tokenized as DELIMs, which then
trigger a syntax error in the value parser.  I can't presently think
of a test case that would detect whether other UAs do exactly the same
thing.)

§4.4.2, 4.4.3 (double- and single-quote string state): EOF in these
contexts is *not* a parse error in CSS 2.1.  I have never liked the
"end of file closes all open constructs without error" rule in CSS
2.1, and have no objection to EOF being made into a parse error inside
a string, as long as we are confident there is no web-compat issue,
particularly if we're scrapping the "end of file closes all open
constructs without error" rule in its *entirety*.  (It appears from §5
that we aren't.) However, then the tokenizer should emit a bad-string
instead of a string when it encounters EOF in these states.

(In general I would prefer that section 4 not speak of "parse errors";
the tokenizer should be conceptualized as not having any error
conditions at all.  All possible streams of input characters produce a
well-defined stream of output tokens; it is the *parser* that always
treats certain output tokens (bad-string and bad-url) as parse
errors.)

§4.4.4 (hash state): css3-selectors requires that ID selectors be
limited to the subset of 'hash' tokens in which a valid CSS identifier
follows the '#'.  Gecko implements this by making an internal
distinction between 'hash' and 'id' tokens.  This might be worth
adding to the official syntax -- perhaps only as a 'this hash token
meets this requirement' flag bit.

§4.4.9 (ident state): If Gecko encounters the unusual sequence of
characters "- \ EOF" or "- \ newline" (spaces for clarity only), the
initial - is emitted as an ident token rather than a delim token.
This is a bug according to both this spec and CSS 2.1, but can anyone
think of a test case that would distinguish this behavior from the
'correct' behavior?

§4.4.10, 4.4.11 (transform function whitespace): see comments on §4.2 above.

§4.4.12 (number state): This state appears to contain unnecessary
sanity checks, since the data state will not transition *into* the
number state without first checking that the initial sequence of
characters forms a valid number.

§4.4.13, §4.4.14 (number-rest, number-fraction): The numeric value of
a percentage token should be "the number produced by interpreting the
number token's representation as a base-10 number" *divided by 100*.
(It might be better to state this in css3-values "Percentage" section
instead, but that's already in CR and doesn't say anything of the sort
right now.)

§4.4.16 (sci-notation state): BUG: needs to allow for dimension and
percentage tokens to contain scientific notation.  SVG 1.1 does in
fact require this (section 4.2 -- see how <length> and <percentage>
are defined in terms of <number>).

§4.4.17 through 4.4.21 (the URL states): As with strings, EOF in any
of these states is not a parse error -- and produces a url token, not
a bad-url token -- in CSS 2.1.  Again, I have no objection to the
change as long as we are confident web compat is not affected.

§4.4.22 (bad URL state): It does simplify things to make this not
balance brackets and braces, but has anyone working on a non-Webkit
UA expressed an opinion on this change?

§4.7 What exactly does it mean to say "the comma token has been
added"?  It was a DELIM in 2.1 but so were :;{}[]().

§5 Two high-level points.

First, white space's significance in selectors is a design error which
should not be allowed to creep into other contexts.  It would be good,
therefore, to exclude whitespace tokens from appearing in any list
other than the prelude of a qualified rule.  This would also simplify
many of the algorithms in this section, which would no longer need to
mention that they discard whitespace.

Second, why can't the generic grammar parse all at-rules as

at-rule : AT-KEYWORD component-value* ( ';' | {}-block )

and leave the rest to specific parsers for each at-rule?  That would
eliminate the need for the "recognized at-rule list" and the
distinction between rule-filed and declaration-filled at-rules.  If a
trailing {}-block is too generic, we could instead define

generic-list : declaration ( ';' generic-list )?
             | qualified-rule generic-list?
             | at-rule generic-list?
             | /*empty*/

at-rule: AT-KEYWORD component-value* ( ';' | '{' generic-list '}' )

but I think I'd prefer keeping the generic at-rule grammar as
open-ended as possible.

Also, I have not carefully checked over the error recovery behavior
make sure it is still consistent with CSS 2.1's explicit "rules for
recovering from parse errors."  It might be a good idea to add back a
similar explicit description of error recovery, if only so that future
modules don't screw it up.

§5.2 (quirks mode): The quirks defined in this spec should be thought
of as augmenting the set of acceptable values for particular property
declarations; they do not affect the generic grammar's determination
of where one declaration ends and another begins.  Therefore, they
belong with the definitions of <length> and <color>, or else with the
definitions of the specific declarations to which they apply; they
should not appear in this spec at all.

§5.4, issue 2: Gecko has several entry points other than the ones
listed, but they are all for specific constructs: media list, color,
selector, keyframe rule, keyframe selector, @supports declaration,
@supports condition.  Therefore it seems inappropriate to me to
mention them in this spec.  (Maybe there should be something about
other CSS modules adding other entry points.)

(For the curious looking at the code, the difference between
nsCSSParser::ParseStyleAttribute and ::ParseDeclarations is only the
argument list, not the grammar, AFAICT.)

§5.4.6 (comma-separated list of component values): "Append _temp_ to
_val_" is ambiguous.  Suppose _val_ is empty and _temp_ is [a b c].
Does appending _temp_ to _val_ set _val_ to [a b c] or to [ [a b c] ]?
I believe that the latter is necessary for correct processing of
e.g. the 'background' shorthand, but I can't tell whether this is
intended to be used for that (since this algorithm is not actually
used in this spec).

§5.4.7 (an+b notation) I think this, too, belongs in the specific
grammar where it is used, not here.  (I note that it is unreferenced
by any other part of this specification.)

Also, above some very low complexity threshold, I find the verbosity
and lack of explicit structure in HTML5-style prose algorithms to be a
serious obstacle to understanding.  In the rest of the spec this is
not a problem, but I would have to rewrite this section in traditional
pseudocode in order to figure out whether it was correct.

§5.5.2, 5.5.3 (at-statement, at-rule): Per recommendation above, these
should be collapsed together.

§5.5.8, 5.5.9 (quirks): Per recommendation above, these should be
removed.

§5.5.11 (functions): Per recommendation above, language about quirks
and 'rect' should be removed.

§6 (Serialization): If my suggestion of not explicitly including
whitespace tokens in lists other than qualified rule preludes is
taken, then this should be changed to insert whitespace rather than an
empty comment if the list being serialized is not a QR-prelude list.

zw

Received on Sunday, 17 February 2013 18:09:17 UTC