Re: [css3-syntax] Critique of Feb 15 draft from Simon Sapin on 2013-02-19 (www-style@w3.org from February 2013)

From: Simon Sapin <simon.sapin@kozea.fr>
Date: Tue, 19 Feb 2013 08:42:45 +0100
To: Zack Weinberg <zackw@panix.com>
CC: www-style@w3.org
Message-ID: <51232CF5.8070303@kozea.fr>
Le 19/02/2013 05:02, Zack Weinberg a écrit :
> On 2013-02-17 8:47 PM, Simon Sapin wrote:
>>> §3.2.1: It is unclear to me why CR and CR LF are normalized to LF at
>>> this stage but FF isn't.  It might actually be simpler and clearer to
>>> define a notion of "one vertical whitespace sequence" and refer to
>>> that only in the small number of tokenizer states that can consume
>>> vertical whitespace.
>>
>> AFAICT it doesn’t make a difference, since newlines are not allowed in
>> quoted strings or in backslash-escapes. But I don’t see how vertical
>> whitespace is a better term than newline.
>
> I'm not sure what you're getting at here.  This is a problem of internal
> consistency.  In section 4.3, "newline" is defined as either U+000A LINE
> FEED or U+000C FORM FEED, with a note that U+000D CARRIAGE RETURN is
> dealt with during preprocessing.  I am suggesting that either FORM FEED
> should also be mapped to LINE FEED in the preprocessing phase, or that
> that part of the preprocessing be eliminated and all four possible
> "newline" sequences be listed as such in 4.3.
[Skipping the U+0000 part, same response.]
> The point of all this is to say that maybe we don't need this
> preprocessing phase at all.

My point is that this is purely editorial: web authors wouldn’t be able 
to tell the difference. But yeah, internal spec consistency and 
eliminating the preprocessing phase sounds good. Without preprocessing 
though, handling \r\n becomes a bit more painful. (It should be a single 
newline.)


> Please do have a look at how nsCSSScanner.cpp is now: I did eliminate
> all "going back" (except that needed for 'an+b' notation, controlled by
> the parser) in favor of multi-character look-ahead.  This was definitely
> worth it in a real implementation because it meant I could completely
> remove the pushback buffer.  It may or may not be worth it in a
> specification.

I don’t have a strong opinion on this.


>> We recently made them "preserved" in selectors, preludes and declaration
>> values. We had a proposal for a --> combinator in selectors, I suspect
>> not knowing that it is a specific token.
>
> That may actually be a web-breaking change.  I'd want to do some
> experiments before giving it the thumbs-up.

Even though CDO and CDC are now "preserved", any higher-level syntax 
(Selectors, MQs, any property value, …) still rejects it.

The only exception are variables, there was some discussion of allowing 
them there:
http://lists.w3.org/Archives/Public/www-style/2013Feb/0196.html

Do you have an example of how this could be web-breaking?


>>> Since we are sanctioning css3-selectors' under-the-table change to the
>>> generic syntax, perhaps we should go further, and generalize the class
>>> of MATCH tokens...
> ...
>> Could work, but this proposal needs an inclusive list of characters. By
>> staying within ASCII, omitting :;,(){}[]"'\ to disambiguate with other
>> CSS constructs and &< for ease of inclusion in an HTML <style> element,
>> I came up with this:
>>
>> !#%+-./=>?@_`
>
> I'd take out = _ - > ` from this list, but otherwise I like it.
>
> _ and - should be excluded to avoid people getting confused about what
> [foo_=bar] or [foo-=bar] means (the computer will interpret the
> punctuation as part of the leading identifier, but humans may expect _=
> and -= to be treated as match operators regardless of spacing).  =
> should be excluded because we don't want the C headache with = and ==
> meaning two different things.  And I think > and ` should be excluded as
> well, because they "belong with" < and ' respectively.

Sounds good. So the list would be !#%+./?@  (in addition to the existing 
tokens.)
% has the same issue with percentage tokens as -_ with idents, but that 
should be fine since attribute selectors require an ident before the 
operator.


> Regardless, I think that it is clearer if the standard does not describe
> percentages as having an integer flag.

I’m fine with this if we’re sure we will never want an 
<integer-percentage> type. (By the way, 42% has the integer flag with 
the current definition, it’s not only multiples of 100%.) But the same 
applies with dimension tokens.


>>> §4.2: This wart can and should be dealt with by the grammar rule for
>>> the `transform` attribute's value.  It does not need to be in the
>>> generic tokenizer.
>>
>> Agreed that this would make more sense, but I don’t care much either way.
>
> I do care because I don't want to have any more special cases in the
> tokenizer.  (I'm already cranky about url().)

I’m starting to agree.


>>> §4.4.2, 4.4.3 (double- and single-quote string state): EOF in these
>>> contexts is *not* a parse error in CSS 2.1.
[…]
> The intent was always that the "Unexpected end of style sheet" rule in
> 4.2 would apply.

I’m fine with changing that. (Same with EOF in url tokens, not quoted below)


>> This is already the case. Note that "parse errors" are not fatal, but
>> only used in validators or for logging in developer tools. They are
>> independent of whether a rule or declaration is ignored as invalid.
>
> This is not clear from the text.  Adding the informative section about
> how error recovery works would help, but I think that the words "parse
> error" really should not appear anywhere in section 4.

It’s in §3: "Certain points in the parsing algorithm are said to be 
parse errors. […]"
I’m not super happy with the wording either. Do you have a suggestions 
to improve it? Also I suppose the term "parse error" could be renamed.


>> Note that bad-string and bad-url are now "preserved" in preludes and
>> declaration values.
>
> I think this is a mistake.  We do not want any future module to get a
> notion to start treating either of them as valid, in any context.

We did this to allow Media Queries error handling, where a syntax error 
makes only one of the comma separated MQs invalid, not the whole list. 
For example, these still need to apply:

@media ], all {}
@media url("/foo" invalid), all {}

The only error handling that the Syntax module can do (if we assume that 
"preludes" are opaque) is drop the whole rule. This is fine for 
Selectors but not MQs.


>> Oh, this is bad. §6.5 of Selectors 3 mentions # followed by an ident,
>> which is more restrictive than the grammar in §10.1 of the same spec.
>> I’m not even sure the restriction is intentional.
>>
>> Test case: (%231 is #1 url-encoded.)
>>
>> data:text/html,<style>%231{}</style><script>document.write(document.styleSheets[0].cssRules[0].selectorText)</script>
>>
>>
>> Gecko and WebKit show nothing (the selector is invalid.) Presto shows
>> #\31, where \31 is the hex escape for 1.
>
> We better find out what IE does before we go changing anything.
> Does the selector *match* <span id="1"> in Presto?

Yes. Green in Opera and IE:
data:text/html,<body id=1><style>%231{background:green

I’ll start a separate thread on [selectors4].


> The nitpicky
> part of my brain thinks there should be a definition of "percentage"
> somewhere in CSS (probably in -values) but it isn't all that important,
> after all we *are* using the word the same way everyone else does.  (It
> might be worth clarifying that they are *not* clamped to the range [0%,
> 100%] though.)

Feel free to start a [css3-values] thread about this. Shouldn’t be a 
problem to add, even in CR.


> Ah, no, you misunderstand.  Lemme try again.  Given the comma-separated list
>
>       a b c, d e f, g h i
>
> is the result of this algorithm supposed to be [ a b c d e f g h i ] or
> [ [ a b c ] [ d e f ] [ g h i ] ]?  "Append" could mean either.

Ah, I didn’t know that "append" was ambiguous in English. I just assumed 
the meaning of the list.append method in Python (vs. list.extend)

Would it help if we define temp as a list of component values, and val 
as a list of lists of component values?


> It's specifically *this section* (an+b) that has gone over the "too
> complicated for a prose algorithm" threshold (which I should also
> emphasize is a thing about my brain; I'm not claiming that everyone has
> this problem).
>
> A railroad diagram might help.
>
> Gecko's implementation *seems* to be much simpler than what is written here:
>
> * On entry, we have just consumed the leading (.
> * Read the next non-whitespace token.
> * If it is a DIMENSION or IDENT whose (unit) text begins with "n-" or
> "-n-", push all of that text except the first one or two characters,
> respectively, back to the scanner.
> * Interpret "123n" as "123 n", "n" as "1 n", and "-n" as "-1 n".
> * Proceed, tokenizing normally.
>
> Of course, we can get away with that because our tokenizer and parser
> operate in lockstep; but I rather strongly suspect a simpler algorithm
> is possible even in the parsing model css3-syntax adopts.

Tab and I went back and forth between to approaches to an+b:

1. "Serialize" tokens/component values back to a string, and reparse 
that string. (What the ED currently has.)

2. Parse tokens as the rest of the parser does.

The second approach seems simple at first, but gets tricky really fast. 
Take these inputs for example:

   2n-1
   2n -1
   2n - 1

They are all valid, but the minus sign can be part of the unit of a 
dimension token, the sign of a number token, or a delim token.

If you can still come up with something simpler, that would be great.

-- 
Simon Sapin
Received on Tuesday, 19 February 2013 07:43:14 UTC