Re: [css3-syntax] Critique of Feb 15 draft from Zack Weinberg on 2013-02-19 (www-style@w3.org from February 2013)

From: Zack Weinberg <zackw@panix.com>
Date: Mon, 18 Feb 2013 23:02:33 -0500
To: Simon Sapin <simon.sapin@kozea.fr>
CC: www-style@w3.org
Message-ID: <5122F959.6020708@panix.com>
On 2013-02-17 8:47 PM, Simon Sapin wrote:
>
>> §3.2.1: It is unclear to me why CR and CR LF are normalized to LF at
>> this stage but FF isn't.  It might actually be simpler and clearer to
>> define a notion of "one vertical whitespace sequence" and refer to
>> that only in the small number of tokenizer states that can consume
>> vertical whitespace.
>
> AFAICT it doesn’t make a difference, since newlines are not allowed in
> quoted strings or in backslash-escapes. But I don’t see how vertical
> whitespace is a better term than newline.

I'm not sure what you're getting at here.  This is a problem of internal 
consistency.  In section 4.3, "newline" is defined as either U+000A LINE 
FEED or U+000C FORM FEED, with a note that U+000D CARRIAGE RETURN is 
dealt with during preprocessing.  I am suggesting that either FORM FEED 
should also be mapped to LINE FEED in the preprocessing phase, or that 
that part of the preprocessing be eliminated and all four possible 
"newline" sequences be listed as such in 4.3.

The thing about "consuming vertical whitespace" is me having Gecko's 
implementation a bit too much on the brain.  You can ignore it.

>> Similarly, it might be simpler and clearer to
>> map U+0000 to U+FFFD only in the small number of tokenizer states that
>> can *consume* U+0000.
>
> Not sure it would be easier: as non-ASCII, U+FFFD is a "name character"
> allowed in ident and ident-like tokens.

I think all you have to do is say that U+0000 also counts as a "name 
character" but whenever it would be "added to a token", U+FFFD is added 
instead.

The point of all this is to say that maybe we don't need this 
preprocessing phase at all.

> Agreed. In particular, url tokens could reuse the definition of quoted
> strings, at-keyword and dimension that of ident, and hash and ident
> could both be based on what CSS 2.1 calls a "name".

I could try to write it up that way if y'all like.

> My rust implementation has both "consume a character" and "current
> character" functions, but occasionally still needs to "go back" one
> character. Maybe "going back" could be avoided entirely with more
> look-ahead, but I’m not sure the result is simpler.

Please do have a look at how nsCSSScanner.cpp is now: I did eliminate 
all "going back" (except that needed for 'an+b' notation, controlled by 
the parser) in favor of multi-character look-ahead.  This was definitely 
worth it in a real implementation because it meant I could completely 
remove the pushback buffer.  It may or may not be worth it in a 
specification.

>
>> CSS 2.1 lumped colon, semicolon, comma, [, ], (, ), {, and } tokens in
>> with delim tokens.  I presume that they have been pulled out for ease
>> of reference in the parser?  I don't think it will affect how
>> implementations do things, but it should still be mentioned in §4.7 as
>> a change from the previous spec.
>
> Do you mean Appendix G? I don’t find very relevant anyway. It’s a
> super-set of what is valid in 2.1, but if you include some Level 3
> modules it’s neither a super-set nor a subset.
>
> Section 4.1.1 (the normative one) does have tokens for these, except comma.

... I have somehow managed not to notice that block of productions until 
now, despite having close-read 4.1.1 dozens of times.  *headdesk*

> We recently made them "preserved" in selectors, preludes and declaration
> values. We had a proposal for a --> combinator in selectors, I suspect
> not knowing that it is a specific token.

That may actually be a web-breaking change.  I'd want to do some 
experiments before giving it the thumbs-up.

>> Since we are sanctioning css3-selectors' under-the-table change to the
>> generic syntax, perhaps we should go further, and generalize the class
>> of MATCH tokens...
...
> Could work, but this proposal needs an inclusive list of characters. By
> staying within ASCII, omitting :;,(){}[]"'\ to disambiguate with other
> CSS constructs and &< for ease of inclusion in an HTML <style> element,
> I came up with this:
>
> !#%+-./=>?@_`

I'd take out = _ - > ` from this list, but otherwise I like it.

_ and - should be excluded to avoid people getting confused about what 
[foo_=bar] or [foo-=bar] means (the computer will interpret the 
punctuation as part of the leading identifier, but humans may expect _= 
and -= to be treated as match operators regardless of spacing).  = 
should be excluded because we don't want the C headache with = and == 
meaning two different things.  And I think > and ` should be excluded as 
well, because they "belong with" < and ' respectively.


>> In CSS 2.1 percentage tokens are unacceptable in any context where an
>> <integer> is required, even if their numeric value is integral
>> (e.g. '200%' == 2).  If this is not intended to change, then they
>> should not be described as having an integer flag.
>
> Percentage and dimension tokens are not <integer> values, but they
> indeed have a "type" flag that is not used in any current CSS property.
> But I guess it does not hurt. Implementations are free not to have it,
> since the difference is never visible.

In Gecko, dimensions have their "valid as an integer" flag set if the 
number part was lexically an integer.  I don't know offhand if anything 
looks at that; I can check tomorrow.  Percentages never have that flag set.

Regardless, I think that it is clearer if the standard does not describe 
percentages as having an integer flag.

>> §4.2: This wart can and should be dealt with by the grammar rule for
>> the `transform` attribute's value.  It does not need to be in the
>> generic tokenizer.
>
> Agreed that this would make more sense, but I don’t care much either way.

I do care because I don't want to have any more special cases in the 
tokenizer.  (I'm already cranky about url().)

>> §4.4.2, 4.4.3 (double- and single-quote string state): EOF in these
>> contexts is *not* a parse error in CSS 2.1.
>
> The tokenizer in section 4.1.1 of CSS 2.1 makes these bad-string tokens,
> which the the Core Grammar in the same section does not allow in
> property values. This is by the way inconsistent with the "Unexpected
> end of style sheet" rule of section 4.2 of CSS 2.1.

That is an error in 4.1.1 (my error, in fact, I invented the 
bad-whatever tokens so that you could implement the CSS 2.1 generic 
grammar without any backing up in Flex, and I missed that one case). 
The intent was always that the "Unexpected end of style sheet" rule in 
4.2 would apply.

>> (In general I would prefer that section 4 not speak of "parse errors";
>> the tokenizer should be conceptualized as not having any error
>> conditions at all.
>
> This is already the case. Note that "parse errors" are not fatal, but
> only used in validators or for logging in developer tools. They are
> independent of whether a rule or declaration is ignored as invalid.

This is not clear from the text.  Adding the informative section about 
how error recovery works would help, but I think that the words "parse 
error" really should not appear anywhere in section 4.


> Note that bad-string and bad-url are now "preserved" in preludes and
> declaration values.

I think this is a mistake.  We do not want any future module to get a 
notion to start treating either of them as valid, in any context.

>> §4.4.4 (hash state): css3-selectors requires that ID selectors be
>> limited to the subset of 'hash' tokens in which a valid CSS identifier
>> follows the '#'.  Gecko implements this by making an internal
>> distinction between 'hash' and 'id' tokens.  This might be worth
>> adding to the official syntax -- perhaps only as a 'this hash token
>> meets this requirement' flag bit.
>
> Oh, this is bad. §6.5 of Selectors 3 mentions # followed by an ident,
> which is more restrictive than the grammar in §10.1 of the same spec.
> I’m not even sure the restriction is intentional.
>
> Test case: (%231 is #1 url-encoded.)
>
> data:text/html,<style>%231{}</style><script>document.write(document.styleSheets[0].cssRules[0].selectorText)</script>
>
>
> Gecko and WebKit show nothing (the selector is invalid.) Presto shows
> #\31, where \31 is the hex escape for 1.

We better find out what IE does before we go changing anything.  Does 
the selector *match* <span id="1"> in Presto?

>> §4.4.12 (number state): This state appears to contain unnecessary
>> sanity checks, since the data state will not transition *into* the
>> number state without first checking that the initial sequence of
>> characters forms a valid number.
>
> Agree that some of it is unecessary. Just removing the "anything else"
> part would be weird if you don’t have all the context. I suggest "This
> case should never happen. This is a programming error."

Sounds good to me; agree it would be weird.

>> §4.4.13, §4.4.14 (number-rest, number-fraction): The numeric value of
>> a percentage token should be "the number produced by interpreting the
>> number token's representation as a base-10 number" *divided by 100*.
>> (It might be better to state this in css3-values "Percentage" section
>> instead, but that's already in CR and doesn't say anything of the sort
>> right now.)
>
> We can still make clarifications in CRs, but I think it’s pretty clear
> what a percentage means. In my mental model though, the division by 100
> only occurs when resolving a percentage relative to some other value.
> The percentage’s own "value" goes up to 100, not 1.

This may be me projecting internal Gecko behavior onto the spec.  I 
don't care all that much what the official "value" of a percentage token 
is, since it's going to be reserialized as x% regardless.  The nitpicky 
part of my brain thinks there should be a definition of "percentage" 
somewhere in CSS (probably in -values) but it isn't all that important, 
after all we *are* using the word the same way everyone else does.  (It 
might be worth clarifying that they are *not* clamped to the range [0%, 
100%] though.)

>> §4.4.16 (sci-notation state): BUG: needs to allow for dimension and
>> percentage tokens to contain scientific notation.  SVG 1.1 does in
>> fact require this (section 4.2 -- see how <length> and <percentage>
>> are defined in terms of <number>).
>
> This was apparently deliberate:
>
> http://lists.w3.org/Archives/Public/www-style/2012Aug/0900.html
> http://lists.w3.org/Archives/Public/www-style/2013Jan/0303.html
>
> But maybe we need to change it for compat with SVG.

I think we definitely do need to change this.

>> §4.4.17 through 4.4.21 (the URL states): As with strings, EOF in any
>> of these states is not a parse error -- and produces a url token, not
>> a bad-url token -- in CSS 2.1.  Again, I have no objection to the
>> change as long as we are confident web compat is not affected.
>
> My reading of CSS 2.1 (both in chapter 4 and Appendix G) shows that it
> does produce a bad-url token.

That's the same problem with 4.1.1 not getting EOF quite right, as for 
strings.  The 4.2 "unexpected end of style sheet" rule trumps.  I'm 
pretty sure all UAs are consistent here.

>> Second, why can't the generic grammar parse all at-rules as
>>
>> at-rule : AT-KEYWORD component-value* ( ';' | {}-block )
>>
>> and leave the rest to specific parsers for each at-rule?  That would
>> eliminate the need for the "recognized at-rule list" and the
>> distinction between rule-filed and declaration-filled at-rules.
>
> I like this, and it actually became much more doable when we recently
> added an independent "Parse a list of declarations" entry point. We
> would also need "Parse a list of rules" but that’s easy.
>
> Each at-rule would need to define that their value/body is parsed by
> calling one of these (or even something else). But they already do
> something similar for the prelude anyway.

Yah.

>
>> If a
>> trailing {}-block is too generic, we could instead define
>>
>> generic-list : declaration ( ';' generic-list )?
>>               | qualified-rule generic-list?
>>               | at-rule generic-list?
>>               | /*empty*/
>>
>> at-rule: AT-KEYWORD component-value* ( ';' | '{' generic-list '}' )
>>
>> but I think I'd prefer keeping the generic at-rule grammar as
>> open-ended as possible.
>
> Mixing declarations and qualified rules is bad as you need unbounded
> look-ahead to disambiguate which you’re parsing.

That's adequately dealt with by each at-rule defining itself as 
containing either this or that.  I want to leave the door open for 
at-rules that contain none of the above, is the thing.

> I would be in favor of having non-normative prose describing error
> recovery, just like the railroad diagrams describe valid syntax.

That would be great.  We could probably lift it almost verbatim from 
CSS2.1 section 4.2.

>> §5.2 (quirks mode): The quirks defined in this spec should be thought
>> of as augmenting the set of acceptable values for particular property
>> declarations; they do not affect the generic grammar's determination
>> of where one declaration ends and another begins.  Therefore, they
>> belong with the definitions of <length> and <color>, or else with the
>> definitions of the specific declarations to which they apply; they
>> should not appear in this spec at all.
>
> Same as the "transform function whitespace" flag. This would make more
> sense but I personally don’t care much.

I think this is important because having it in the wrong place (i.e. 
here) may give people the mistaken impression that the generic parsing 
differs in quirks mode.  You have to read this document very carefully 
to realize that it doesn't.

>> §5.4, issue 2: Gecko has several entry points other than the ones
>> listed, but they are all for specific constructs: media list, color,
>> selector, keyframe rule, keyframe selector, @supports declaration,
>> @supports condition.  Therefore it seems inappropriate to me to
>> mention them in this spec.  (Maybe there should be something about
>> other CSS modules adding other entry points.)
>
> My understanding is that all of these belong in other modules. They
> would themselves use one or more of Syntax’s entry points and work on
> rules, declarations, and component values; not tokens.

Agree.

>> §5.4.6 (comma-separated list of component values): "Append _temp_ to
>> _val_" is ambiguous.  Suppose _val_ is empty and _temp_ is [a b c].
>> Does appending _temp_ to _val_ set _val_ to [a b c] or to [ [a b c] ]?
>> I believe that the latter is necessary for correct processing of
>> e.g. the 'background' shorthand, but I can't tell whether this is
>> intended to be used for that (since this algorithm is not actually
>> used in this spec).
>
> The latter. It’s "append", not "set".

Ah, no, you misunderstand.  Lemme try again.  Given the comma-separated list

     a b c, d e f, g h i

is the result of this algorithm supposed to be [ a b c d e f g h i ] or 
[ [ a b c ] [ d e f ] [ g h i ] ]?  "Append" could mean either.

>> Also, above some very low complexity threshold, I find the verbosity
>> and lack of explicit structure in HTML5-style prose algorithms to be a
>> serious obstacle to understanding.  In the rest of the spec this is
>> not a problem, but I would have to rewrite this section in traditional
>> pseudocode in order to figure out whether it was correct.
>
> I don’t know any other style that allow specifying all of the error
> handling without being much more complex. For example, a "total" grammar
> for CSS (so that there is no non-matching input) would be unreadable.
> The railroad diagrams are meant to help with the high-level
> understanding. I could write on up for an+b.

It's specifically *this section* (an+b) that has gone over the "too 
complicated for a prose algorithm" threshold (which I should also 
emphasize is a thing about my brain; I'm not claiming that everyone has 
this problem).

A railroad diagram might help.

Gecko's implementation *seems* to be much simpler than what is written here:

* On entry, we have just consumed the leading (.
* Read the next non-whitespace token.
* If it is a DIMENSION or IDENT whose (unit) text begins with "n-" or 
"-n-", push all of that text except the first one or two characters, 
respectively, back to the scanner.
* Interpret "123n" as "123 n", "n" as "1 n", and "-n" as "-1 n".
* Proceed, tokenizing normally.

Of course, we can get away with that because our tokenizer and parser 
operate in lockstep; but I rather strongly suspect a simpler algorithm 
is possible even in the parsing model css3-syntax adopts.

> I don’t like serialization being context-sensitive. (Although this is
> not a strong opinion.)

Well, the whitespace proposal is a non-starter anyway as dbaron pointed 
out.  And having thought about it some more, I agree context sensitivity 
is bad.

zw
Received on Tuesday, 19 February 2013 04:03:03 UTC