[css3-syntax] Thoughts on proposed Syntax module from L. David Baron on 2012-08-28 (www-style@w3.org from August 2012)

From: L. David Baron <dbaron@dbaron.org>
Date: Tue, 28 Aug 2012 14:28:18 -0700
To: www-style@w3.org
Message-ID: <20120828212818.GA13291@crum.dbaron.org>
Some thoughts on the css3-syntax draft at
http://dev.w3.org/csswg/css3-syntax/ follow, both on the general
approach and on the lists of changes.  I haven't read the state
machine in detail.


As I said at the face-to-face meeting, I think the approach that the
specification takes to CSS's (), [], and {} matching rules is going
in the wrong direction.  I think the normative specification text
for these should be the general statements about how the processing
works, and not the code-like form of the current specification,
since I really *don't* want bugs in the specification that break the
general rules to end up being codified in the specification.  I
worry that ending up with exceptions to these rules could prevent us
from making general improvements to parsing technology that would
otherwise (without exceptions) be possible.  (For example, we might
at some point in the future have generated parsers based on two
different but parallel state machines, one describing the correct
syntax and another describing the error handling behavior (for when
the first state machine goes into a failure state) -- done in a way
that a state in the error handling state machine can be determined
at parser generation time from the state in the correct-input state
machine.)

That said, I think the problems with this approach don't show up
much in the material currently specified; I think most of the
problems appear when describing how to parse the syntax of all the
property values, which is where the bulk of CSS parsing logic lives.
It's not clear to me whether
http://dev.w3.org/csswg/css3-syntax/#declaration-value-mode0 is the
extent of your plans for specifying how to parse CSS values or
whether you're planning to actually specify value parsing in a
similar way to the rest of the specification.


I also think this sort of specification describing a state machine
in prose is generally far less readable than a specification that
describes a tokenization and grammar in a concise format.  I think
the special case of HTML parsing (which has so many complex rules
that it can't reasonably be written in a concise format) doesn't
mean that all other languages should be described in the same prose
style.  Yes, CSS 2.1's description of parsing is not as precise as
it should be, but I'm not at all convinced that the fix to that
problem needs to be as drastic as switching to a state machine
written in prose.


Some specific comments on "3.5 Changes from the CSS 2.1 Tokenizer":
==================================================================

  # 1. The DASHMATCH and INCLUDES tokens have been removed. They can
  # instead be handled simply by having them parse as DELIM tokens.
  # It was weird to privilege just those two types of attribute
  # equality operators, when Selectors 3 adds several more.

I think this is a mistake.  In Gecko we treat these, and all the new
selectors introduced in css3-selectors, as tokens.  In particular,
not treating DASHMATCH as a separate token type makes the rules for
parsing namespaces in attribute selectors extremely complicated;
with DASHMATCH as a separate token it's trivial to implement
correctly.  I'd strongly prefer to leave these as distinct tokens
and make new ones for the new selectors.

  # 2. The BAD-URI token (now bad-url) is "self-contained". In other
  # words, once the tokenizer realizes it's in a bad-url rather than
  # a url token, it just seeks forward to look for the closing ),
  # ignoring everything else. This behavior is simpler than treating
  # it like a FUNCTION token and paying attention to opened blocks
  # and such. Only WebKit exhibits this behavior, but it doesn't
  # appear that we've gotten any compat bugs from it. 

So if I'm understanding this correctly, this is more than the change
we already made for issue 129 that's described in
https://bugzilla.mozilla.org/show_bug.cgi?id=569646 .  You're saying
that not only do we ignore [] and {} that are prior to the point at
which the URL is known to be invalid, but that you also ignore []
and {} that are *after* that point, until you reach a closing )?

I guess this change seems reasonable to me.


Some specific comments on "3.7. Changes from CSS 2.1 Core Grammar":
==================================================================

  # 1. No whitespace or comments are allowed between the DELIM(!)
  # and IDENT(important) tokens when processing an !important
  # directive at the end of a style rule. 

I disagree with this change; I think disallowing whitespace is a
significant compatibility problem.  There are a significant number
of uses with whitespace that people have written in Gecko's codebase
(including the only use of !important in a code example in our
userContent-example.css that explains how to write user style
sheets).  Three of the examples in the cascading chapter of CSS 2.1
also use whitespace.

  # 2. The handling of some miscellanous ‘special’ tokens (like an
  # unmatched } token) showing up in various places in the grammar
  # has been specified with some reasonable behavior shown by at
  # least one browser. Previously, stylesheets with those tokens in
  # those places just didn't match the stylesheet grammar at all, so
  # their handling was totally undefined. 

I'm hoping that you defined unmatched } ) and ] to behave like any
other incorrect token at that spot would behave.  Is that the case?

What other cases are there?

  # 3. Quirks mode parsing differences are now officially
  # recognized in the parser. 

I think these quirks should be described in terms of the value
grammar of the properties rather than a token postprocessing step.
While the behavior isn't distingishable in current implementations,
variables make it distinguishable.  I believe implementations
implement it as a change to the grammar of the properties; at the
very least, Gecko does.  Describing it the way you do would require
implementations to completely reimplement these quirks when they
implement variables (so that they reject quirky values that were
inserted by variable substitution).  (Or do other implementations
actually implement it as a token processing step?)


-David

-- 
𝄞   L. David Baron                         http://dbaron.org/   𝄂
𝄢   Mozilla                           http://www.mozilla.org/   𝄂
Received on Tuesday, 28 August 2012 21:28:41 UTC