Re: [css3-syntax] Critique of Feb 15 draft

Thank you so much for this detailed critique!  Sorry for the delay in
responding - I was trying to deal with all of it before sending the
response.  ^_^

On Sun, Feb 17, 2013 at 10:08 AM, Zack Weinberg <zackw@panix.com> wrote:
> §3.2: As a security precaution, UAs should skip steps 3 and 4 of the
> algorithm to determine the fallback encoding when the referring
> stylesheet or document is not _same-origin_ with the stylesheet (refer
> to RFC 6454 for the definition of _same-origin_).  See
> <https://www.w3.org/Bugs/Public/show_bug.cgi?id=20886>.

So you support something like what the issue says, forcing the host
language to define an "environment charset"?  Or should we just handle
that here in CSS?

> §3.2.1: It is unclear to me why CR and CR LF are normalized to LF at
> this stage but FF isn't.  It might actually be simpler and clearer to
> define a notion of "one vertical whitespace sequence" and refer to
> that only in the small number of tokenizer states that can consume
> vertical whitespace.

In a string, a normal newline is invalid, but you can escape a newline
to include it in the string.  I think it makes the spec a little
simpler to collapse \r\n into a single newline char, so I don't have
to push those details into the string states.  That's all.  I could
simplify the preprocessing to only convert \r\n specifically into a
\n, as it's not actually necessary to do anything with a lone \r.
Thoughts?

Other than that, you don't want to collapse whitespace into itself.
Multiple newline chars in a sequence have to be escaped separately in
a string, and anywhere else whitespace just gets ignored in a
sequence, so it doesn't really matter.

> Similarly, it might be simpler and clearer to
> map U+0000 to U+FFFD only in the small number of tokenizer states that
> can *consume* U+0000.

As Simon says, a lot of states can potentially consume this.  If we
don't want U+0000 to show up at all, we should avoid possible errors
and handle it somewhere early.

> The last two paragraphs in this section conceptually belong to the
> "Tokenization" section.

You're right.  Fixed.

> §4: The algorithm in this section might benefit from conversion to a
> recursive-descent presentation as was already done for the parser (§5).

I'm not certain.  Switching to a RC parser was great for the Parsing
section because those types of parsers are good for nested structures,
which is most of CSS.  Tokens only *barely* have a concept of nesting,
in things like the fact that both types of strings have nearly the
same internals, or url tokens contain strings.

It would save some duplication, but I could probably also do that just
by defining some more functions.

> It might also benefit from a revision which consistently treated the
> "current input character" as _not_ already consumed; only consume
> characters when you have committed to including them in the current
> token.  I just made an analogous change throughout Gecko's tokenizer
> (https://bugzilla.mozilla.org/show_bug.cgi?id=543151 ) and I think
> readability is improved quite a bit.

Given the number of places that reconsume the current input character,
compared to the number of places that don't, I think it would worsen
readability to make this change in the spec.  However, it's perfectly
appropriate to *consider* it like you said, where the current input
character is just the first character in the input stream.

Your issue in the cited bug, where you had a pushback buffer, doesn't
apply here, since reconsuming never puts more than a single character
back.

> If we are adjusting the tokenizer for the convenience of the parser
> anyway, it might make sense to merge CDO and CDC, which are always
> treated the same.

As Simon said, we need to preserve these.  They were always allowed in
CSS 2.1 in most contexts, though no rule or declaration ever accepted
them in their grammar.  However, Variables wants to preserve them (see
dbaron's arguments in the most recent f2f minutes, where we agreed to
change the syntax of custom properties to explicitly allow CDO and CDC
tokens, since they aren't included in the top-level of a "value"
production).

> Since we are sanctioning css3-selectors' under-the-table change to the
> generic syntax, perhaps we should go further, and generalize the class
> of MATCH tokens so that any punctuation character other than :;,(){}[]
> followed by an equals sign is a single token?  That would allow
> further addition of selector match operators without having to come
> back and change the tokenizer again.  (The major reason I can think of
> *not* to do this is that it might cause trouble for future extensions
> to calc().  But C does fine with a bunch of x= operators, so I doubt
> it will be a problem in practice.)

I'd be fine with this, if other people think it's reasonable.  I'd
like dbaron's opinion, since he's the one who wanted me to add the
extra *-match tokens in the first place.

> In CSS 2.1 percentage tokens are unacceptable in any context where an
> <integer> is required, even if their numeric value is integral
> (e.g. '200%' == 2).  If this is not intended to change, then they
> should not be described as having an integer flag.

<integer> is a type of <number>, which is not a <percentage>, so yes,
that's true forever.

I suppose this is reasonable - I can't think of any reason why a
percentage or a dimension would ever be restricted to just integers.
Fixed.

> §4.2: This wart can and should be dealt with by the grammar rule for
> the `transform` attribute's value.  It does not need to be in the
> generic tokenizer.

I don't know about this.  It would be relatively annoying to try and
express at that end.

> A more significant SVG wart, which *does* require a tokenizer flag,
> is: in Gecko, IE, and Opera (but not Webkit), CSS comments are
> disabled when parsing an SVG mapped attribute.  For instance:
>
>     <circle cx="10" cy="10" r="5"
>       fill="#E0B0FF /* international danger mauve */"/>
>     <circle cx="20" cy="10" r="5"
>       style="fill:#E0B0FF /* international danger mauve */"/>
>
> Webkit draws both circles in mauve.  Gecko, IE, and Opera draw only
> the second one in mauve.  (Complete test case at
> <http://www.owlfolio.org/scratchpad/danger-mauve.svg>.)  It is not
> clear to me whether this behavior is required by the SVG spec (I only
> looked at SVG 1.1), and honestly I would be happy to take the mode bit
> out of Gecko's tokenizer.
>
> (Gecko's implementation simply cuts the comment rule out of the
> tokenizer, so the slash and star are tokenized as DELIMs, which then
> trigger a syntax error in the value parser.  I can't presently think
> of a test case that would detect whether other UAs do exactly the same
> thing.)

The SVGWG is considering switching over to having presentation
attributes parse as CSS,
so I'd rather not do anything here until/unless they decide for
certain not to do that,
and we establish that there's no other way to deal with it.

> §4.4.2, 4.4.3 (double- and single-quote string state): EOF in these
> contexts is *not* a parse error in CSS 2.1.  I have never liked the
> "end of file closes all open constructs without error" rule in CSS
> 2.1, and have no objection to EOF being made into a parse error inside
> a string, as long as we are confident there is no web-compat issue,
> particularly if we're scrapping the "end of file closes all open
> constructs without error" rule in its *entirety*.  (It appears from §5
> that we aren't.) However, then the tokenizer should emit a bad-string
> instead of a string when it encounters EOF in these states.

I've fixed it so that EOF is not a parse error in strings or urls;
instead it just finishes the token.

> (In general I would prefer that section 4 not speak of "parse errors";
> the tokenizer should be conceptualized as not having any error
> conditions at all.  All possible streams of input characters produce a
> well-defined stream of output tokens; it is the *parser* that always
> treats certain output tokens (bad-string and bad-url) as parse
> errors.)

I've finally linked up "parse error" to the definition, like I'd
always intended.  Parse errors don't really mean anything - they're
just there for conformance checkers to help flag errors.  The actual
error-handling is fully specified by the spec itself.

> §4.4.4 (hash state): css3-selectors requires that ID selectors be
> limited to the subset of 'hash' tokens in which a valid CSS identifier
> follows the '#'.  Gecko implements this by making an internal
> distinction between 'hash' and 'id' tokens.  This might be worth
> adding to the official syntax -- perhaps only as a 'this hash token
> meets this requirement' flag bit.

Sounds useful.  Added.  The hash-rest state now sets a type flag on
the hash token before emitting it to either "id" or "unrestricted".

> §4.4.9 (ident state): If Gecko encounters the unusual sequence of
> characters "- \ EOF" or "- \ newline" (spaces for clarity only), the
> initial - is emitted as an ident token rather than a delim token.
> This is a bug according to both this spec and CSS 2.1, but can anyone
> think of a test case that would distinguish this behavior from the
> 'correct' behavior?

As Simon says, given that the DELIM(\) token is emitted afterwards and
would invalidate anything it shows up in, it's indistinguishable in
practice.  (The only thing it wouldn't invalidate is a custom
property, but you can only observe its value by putting it into
another property, which it would invalidate, or serializing, which
prevents you from telling an IDENT and DELIM apart.)

> §4.4.10, 4.4.11 (transform function whitespace): see comments on §4.2 above.

See response on 4.2 above.

> §4.4.12 (number state): This state appears to contain unnecessary
> sanity checks, since the data state will not transition *into* the
> number state without first checking that the initial sequence of
> characters forms a valid number.

You're right.  Removing all the sanity checks made this step much cleaner.

> §4.4.13, §4.4.14 (number-rest, number-fraction): The numeric value of
> a percentage token should be "the number produced by interpreting the
> number token's representation as a base-10 number" *divided by 100*.
> (It might be better to state this in css3-values "Percentage" section
> instead, but that's already in CR and doesn't say anything of the sort
> right now.)

Simon discussed this adequately - I'd prefer to keep it as it is.

> §4.4.16 (sci-notation state): BUG: needs to allow for dimension and
> percentage tokens to contain scientific notation.  SVG 1.1 does in
> fact require this (section 4.2 -- see how <length> and <percentage>
> are defined in terms of <number>).

Ooh, good to know.  I'll have to ping the WG about this since it's
going against a previous resolution, but I don't have a problem with
this.

> §4.4.17 through 4.4.21 (the URL states): As with strings, EOF in any
> of these states is not a parse error -- and produces a url token, not
> a bad-url token -- in CSS 2.1.  Again, I have no objection to the
> change as long as we are confident web compat is not affected.

Fixed by making it emit a url token.

> §4.4.22 (bad URL state): It does simplify things to make this not
> balance brackets and braces, but has anyone working on a non-Webkit
> UA expressed an opinion on this change?

Not that I know of. :/

> §4.7 What exactly does it mean to say "the comma token has been
> added"?  It was a DELIM in 2.1 but so were :;{}[]().

Simon answered this.

> §5 Two high-level points.
>
> First, white space's significance in selectors is a design error which
> should not be allowed to creep into other contexts.  It would be good,
> therefore, to exclude whitespace tokens from appearing in any list
> other than the prelude of a qualified rule.  This would also simplify
> many of the algorithms in this section, which would no longer need to
> mention that they discard whitespace.

Simon and dbaron answered this.

> Second, why can't the generic grammar parse all at-rules as
>
> at-rule : AT-KEYWORD component-value* ( ';' | {}-block )
>
> and leave the rest to specific parsers for each at-rule?  That would
> eliminate the need for the "recognized at-rule list" and the
> distinction between rule-filed and declaration-filled at-rules.  If a
> trailing {}-block is too generic, we could instead define
>
> generic-list : declaration ( ';' generic-list )?
>              | qualified-rule generic-list?
>              | at-rule generic-list?
>              | /*empty*/
>
> at-rule: AT-KEYWORD component-value* ( ';' | '{' generic-list '}' )
>
> but I think I'd prefer keeping the generic at-rule grammar as
> open-ended as possible.

Hm, after some thought, I think you're probably right.  For one, this
means that parsing won't change when an at-rule switches from
"unknown" to "known".  For two, it lets us potentially do even crazier
things in the contents of at-rules, like directly embed SVG. ^_^
Done.

> Also, I have not carefully checked over the error recovery behavior
> make sure it is still consistent with CSS 2.1's explicit "rules for
> recovering from parse errors."  It might be a good idea to add back a
> similar explicit description of error recovery, if only so that future
> modules don't screw it up.

Done.  It's an informative section, but it goes through all the cases.

> §5.2 (quirks mode): The quirks defined in this spec should be thought
> of as augmenting the set of acceptable values for particular property
> declarations; they do not affect the generic grammar's determination
> of where one declaration ends and another begins.  Therefore, they
> belong with the definitions of <length> and <color>, or else with the
> definitions of the specific declarations to which they apply; they
> should not appear in this spec at all.

Okay, okay, fine.  I'll discuss with fantasai about where to move them to.

> §5.4, issue 2: Gecko has several entry points other than the ones
> listed, but they are all for specific constructs: media list, color,
> selector, keyframe rule, keyframe selector, @supports declaration,
> @supports condition.  Therefore it seems inappropriate to me to
> mention them in this spec.  (Maybe there should be something about
> other CSS modules adding other entry points.)

Yeah, good point.  Added a note.

> §5.4.6 (comma-separated list of component values): "Append _temp_ to
> _val_" is ambiguous.  Suppose _val_ is empty and _temp_ is [a b c].
> Does appending _temp_ to _val_ set _val_ to [a b c] or to [ [a b c] ]?
> I believe that the latter is necessary for correct processing of
> e.g. the 'background' shorthand, but I can't tell whether this is
> intended to be used for that (since this algorithm is not actually
> used in this spec).

It's the latter - I didn't say to append the contents of temp to val.

Plus, if you did the former it would kinda defeat the purpose of
having temp at all.
It would just be the same as consuming a list of component values,
except you threw away commas.

> §5.4.7 (an+b notation) I think this, too, belongs in the specific
> grammar where it is used, not here.  (I note that it is unreferenced
> by any other part of this specification.)

I disagree - I consider this essentially similar to a new token, just
one that's ambiguous with a bunch of existing tokens.  The reason it's
unreferenced by anything else in this spec is because it's impossible
to tell when to apply it until you apply language-specific knowledge.

> Also, above some very low complexity threshold, I find the verbosity
> and lack of explicit structure in HTML5-style prose algorithms to be a
> serious obstacle to understanding.  In the rest of the spec this is
> not a problem, but I would have to rewrite this section in traditional
> pseudocode in order to figure out whether it was correct.

What do you recommend doing?  As Simon argued, I believe handling this
purely as a token-based algorithm is too complex, because there are
*way* more cases than it appears at first.  I think this is roughly as
simple as I can make it without actually defining a new tokenizer for
the contents.

> §5.5.2, 5.5.3 (at-statement, at-rule): Per recommendation above, these
> should be collapsed together.
>
> §5.5.8, 5.5.9 (quirks): Per recommendation above, these should be
> removed.
>
> §5.5.11 (functions): Per recommendation above, language about quirks
> and 'rect' should be removed.
>
> §6 (Serialization): If my suggestion of not explicitly including
> whitespace tokens in lists other than qualified rule preludes is
> taken, then this should be changed to insert whitespace rather than an
> empty comment if the list being serialized is not a QR-prelude list.

As dbaron argued, this isn't a good idea.

~TJ

Received on Sunday, 24 February 2013 04:33:01 UTC