Re: [css3-syntax] Thoughts on proposed Syntax module from Tab Atkins Jr. on 2012-08-29 (www-style@w3.org from August 2012)

From: Tab Atkins Jr. <jackalmage@gmail.com>
Date: Tue, 28 Aug 2012 22:32:44 -0700
To: "L. David Baron" <dbaron@dbaron.org>
Cc: www-style@w3.org
Message-ID: <CAAWBYDD8ojVf+pEr2vuV7NCy=fCQpbYkJWLtmoP8L+Zd+5xArA@mail.gmail.com>
On Tue, Aug 28, 2012 at 2:28 PM, L. David Baron <dbaron@dbaron.org> wrote:
> Some thoughts on the css3-syntax draft at
> http://dev.w3.org/csswg/css3-syntax/ follow, both on the general
> approach and on the lists of changes.  I haven't read the state
> machine in detail.
>
>
> As I said at the face-to-face meeting, I think the approach that the
> specification takes to CSS's (), [], and {} matching rules is going
> in the wrong direction.  I think the normative specification text
> for these should be the general statements about how the processing
> works, and not the code-like form of the current specification,
> since I really *don't* want bugs in the specification that break the
> general rules to end up being codified in the specification.  I
> worry that ending up with exceptions to these rules could prevent us
> from making general improvements to parsing technology that would
> otherwise (without exceptions) be possible.  (For example, we might
> at some point in the future have generated parsers based on two
> different but parallel state machines, one describing the correct
> syntax and another describing the error handling behavior (for when
> the first state machine goes into a failure state) -- done in a way
> that a state in the error handling state machine can be determined
> at parser generation time from the state in the correct-input state
> machine.)
>
> That said, I think the problems with this approach don't show up
> much in the material currently specified; I think most of the
> problems appear when describing how to parse the syntax of all the
> property values, which is where the bulk of CSS parsing logic lives.
> It's not clear to me whether
> http://dev.w3.org/csswg/css3-syntax/#declaration-value-mode0 is the
> extent of your plans for specifying how to parse CSS values or
> whether you're planning to actually specify value parsing in a
> similar way to the rest of the specification.

No, I was not planning to.  The major problems I have with the CSS
grammar relate to the understandability of large grammars, and of
grammars that need to be total.  Neither of these are true of
properties and at-rules - both usually have small, easily understood
and implemented grammars, and neither need to be total - you apply
them against an already-delimited set of tokens to see if they match
(and if so, break it up into named pieces), so it's okay to simply not
match.


> I also think this sort of specification describing a state machine
> in prose is generally far less readable than a specification that
> describes a tokenization and grammar in a concise format.  I think
> the special case of HTML parsing (which has so many complex rules
> that it can't reasonably be written in a concise format) doesn't
> mean that all other languages should be described in the same prose
> style.  Yes, CSS 2.1's description of parsing is not as precise as
> it should be, but I'm not at all convinced that the fix to that
> problem needs to be as drastic as switching to a state machine
> written in prose.

I agree that it's much less readable for someone looking for an
overview of the syntax.  I have an introductory section explaining the
grammar in much simpler terms, and plan to expand and improve it.

However, I disagree that it's less readable as a specification when
you want details.  Reading complex regexes is simply hard.  In the
course of just my couple of years in this group, I've seen several
discussions where people disagreed about what a particular regex in
the spec meant, or how best to write a regex that captures a certain
concept that everyone understands.  I find it *much* easier to read
and write a bit of state-machine for this kind of thing.  Maybe I'm
just weird.  ^_^

The parser, in particular, is not very long at all (CSS is pretty
simple at the parser level), and I think the error-handling is *much*
simpler to understand and interoperably implement like this than like
Chapter 4's hand-wavey rules.  (As evidenced by the fact that everyone
implements error-handling somewhat differently today.)

Tokenization is a somewhat different beast - it's pretty long, just
because we have a decent number of tokens.  Alternately, we could go
with JSON-style train diagrams for defining tokenization.  They strike
a good balance - easier to read and understand than regexes, and more
compact than state machines.  I wouldn't be opposed to this at all -
it would just take me a bit to write the necessary SVG.

>
> Some specific comments on "3.5 Changes from the CSS 2.1 Tokenizer":
> ==================================================================
>
>   # 1. The DASHMATCH and INCLUDES tokens have been removed. They can
>   # instead be handled simply by having them parse as DELIM tokens.
>   # It was weird to privilege just those two types of attribute
>   # equality operators, when Selectors 3 adds several more.
>
> I think this is a mistake.  In Gecko we treat these, and all the new
> selectors introduced in css3-selectors, as tokens.  In particular,
> not treating DASHMATCH as a separate token type makes the rules for
> parsing namespaces in attribute selectors extremely complicated;
> with DASHMATCH as a separate token it's trivial to implement
> correctly.  I'd strongly prefer to leave these as distinct tokens
> and make new ones for the new selectors.

This seems strange to me.  An attr name is just an IDENT - with
namespaces, it's just an IDENT DELIM(|) IDENT.  Disambuating this from
a |= selector requires only a single token of lookahead when doing
Selector parsing.

However, I'm fine with parsing them specially and adding the new
Selectors 4 tokens.  I was trying, as much as possible, to just
implement the Chapter 4 grammar, so I didn't feel comfortable adding
new ones, but I also severely disliked the asymmetry of having only
some of the attr relations expressed as tokens.  Having them all as
tokens (with the knowledge that we may expand the list in the future
if we add more) is fine with me.

>   # 2. The BAD-URI token (now bad-url) is "self-contained". In other
>   # words, once the tokenizer realizes it's in a bad-url rather than
>   # a url token, it just seeks forward to look for the closing ),
>   # ignoring everything else. This behavior is simpler than treating
>   # it like a FUNCTION token and paying attention to opened blocks
>   # and such. Only WebKit exhibits this behavior, but it doesn't
>   # appear that we've gotten any compat bugs from it.
>
> So if I'm understanding this correctly, this is more than the change
> we already made for issue 129 that's described in
> https://bugzilla.mozilla.org/show_bug.cgi?id=569646 .  You're saying
> that not only do we ignore [] and {} that are prior to the point at
> which the URL is known to be invalid, but that you also ignore []
> and {} that are *after* that point, until you reach a closing )?
>
> I guess this change seems reasonable to me.

Correct.  It was slightly simpler to spec, seems irrelevant for
authors, and matches at least one browser's behavior.

> Some specific comments on "3.7. Changes from CSS 2.1 Core Grammar":
> ==================================================================
>
>   # 1. No whitespace or comments are allowed between the DELIM(!)
>   # and IDENT(important) tokens when processing an !important
>   # directive at the end of a style rule.
>
> I disagree with this change; I think disallowing whitespace is a
> significant compatibility problem.  There are a significant number
> of uses with whitespace that people have written in Gecko's codebase
> (including the only use of !important in a code example in our
> userContent-example.css that explains how to write user style
> sheets).  Three of the examples in the cascading chapter of CSS 2.1
> also use whitespace.

Huh, okay, I didn't realize that.  I was operating under the
assumption of no compat problems, and hoping someone would correct me
if so, so thanks.  ^_^

I'm fine with allowing this if necessary, I just wanted to avoid it if
I could.  In conjunction with the change you want to disallow ! at the
top-level of properties for any purpose other than !important, this
goes back to being easy - I can just throw away any whitespace or
comments after the !, and then either mark the whole property as
either important or invalid based on the first token I see other than
whitespace or comment.

>   # 2. The handling of some miscellanous ‘special’ tokens (like an
>   # unmatched } token) showing up in various places in the grammar
>   # has been specified with some reasonable behavior shown by at
>   # least one browser. Previously, stylesheets with those tokens in
>   # those places just didn't match the stylesheet grammar at all, so
>   # their handling was totally undefined.
>
> I'm hoping that you defined unmatched } ) and ] to behave like any
> other incorrect token at that spot would behave.  Is that the case?

Yes.  (They're just appended to the value like any other token, and
then get rejected by the grammar of whatever they appear in.)

> What other cases are there?

I'd have to refresh the parser into my head to see.  Figuring out
exactly what cases are undefined in the 2.1 grammar is hard. :/

>   # 3. Quirks mode parsing differences are now officially
>   # recognized in the parser.
>
> I think these quirks should be described in terms of the value
> grammar of the properties rather than a token postprocessing step.
> While the behavior isn't distingishable in current implementations,
> variables make it distinguishable.  I believe implementations
> implement it as a change to the grammar of the properties; at the
> very least, Gecko does.  Describing it the way you do would require
> implementations to completely reimplement these quirks when they
> implement variables (so that they reject quirky values that were
> inserted by variable substitution).  (Or do other implementations
> actually implement it as a token processing step?)

You mean, just provide an alternative grammar for those properties
that includes <number>s in addition to <length>s, or <number>s,
<ident>s, and <delim>s in addition to <color>s?

That's possible, sure, but a little bit more difficult I think.
Doable if you feel strongly about it.  (I think, for the color thing,
I'd just define a new grammar term like

I'm not sure how we implement the quirks.  I can find out.

~TJ
Received on Wednesday, 29 August 2012 05:33:35 UTC