Re: [css3-syntax] First draft of parser section completed from Kang-Hao (Kenny) Lu on 2012-06-12 (www-style@w3.org from June 2012)

From: Kang-Hao (Kenny) Lu <kennyluck@csail.mit.edu>
Date: Tue, 12 Jun 2012 14:15:28 +0800
To: "Tab Atkins Jr." <jackalmage@gmail.com>
CC: WWW Style <www-style@w3.org>
Message-ID: <4FD6DE80.7020909@csail.mit.edu>
(12/06/12 9:12), Tab Atkins Jr. wrote:
> On Fri, Jun 8, 2012 at 9:01 PM, Kang-Hao (Kenny) Lu
>> 3. You seem to assume that bad-url doesn't open a "block". CSS 2.1 is a
>> bit vague on this (it doesn't say a bad-url contributes to a unbalanced
>> '(' or not), but since at least IE and Firefox implement this, this
>> should be marked as an issue.
> 
> [snip]
> 
> A quick search of our bugzilla revealed zero bugs about the
> block-parsing thing, so for now I'm going to assume that it's okay to
> do the simple thing and just handle this in the tokenizer, ignoring
> any blocks that get opened in the meantime.

That's quite interesting. So if I understand your change correctly,
equivalently you are appending something like {baduritail}, which is

 ([^\)\\]|{escape}|\\{nl})*(\)|\\)?

to each of {baduri1}, {baduri2} and {baduri3}, in the flex grammar in
CSS2.1 right? (Also, the position of BAD_URI and URI have to be switched
so that a valid URI won't be caught in this regexp.)

I don't have an opinion about this for now, but again, it would be nice
if css3-syntax has a list of things that are changed since CSS 2.1 so
that for people who have read the flex grmmar, we don't have to read all
the states to see what's changed. This growing list, as far as tell has

  * DASHMATCH and INCLUDES are gone.
  * BAD_URI is changed to be self-contained (including the ')').
  * a string ending with an EOF following backslash is now a STRING
    instead of a BAD_STRING.
  * BAD_STRING now contains the trailing newline character that makes
    it a BAD_STRING.
  * a new line following a backslash changed from a DILIM S to just a
    DEMIM.

(There might be other differences only the editor knows!)


Some error picking with regard to BAD_URI states:

1. In "URL-unquoted state", a newline following the backslash should
switch the state to "Bad-URL state".

2. In "URL-end state",

  # anything else
  #
  # This is a parse error. Switch to the bad-url state.

should be

  | This is a parse error. Switch to the bad-url state. Reconsume the
  | current input character.

for a case like "url(a \) )"

>> 4. In "3.5.13. Consume a block"
>>
>>  # whitespace token
>>  #
>>  # Do nothing.
>>
>> If you do this, UA can't tell if "calc(1+1)" is different from "calc(1 +
>> 1)", while the former is non-conforming. (Even if we end up allowing
>> optional spaces in calc(), there's still "attr(ns|name)").
> 
> Ah, yeah, you're right. :/ It's not strictly required from a parsing
> standpoint (if you didn't include some necessary whitespace, it would
> have tokenized differently)

Well, at least in the "1+1" case, spaces don't really matter in
tokenization.

> , but simpler rules for humans translate into slightly more
> complication on my side.  I'll preserve whitespace tokens, then.

>> 7*. According to CSS 2.1, the '}' token triggers a "parse error" if it
>> is the first token in the Declaration-value mode
> 
> I'm not sure precisely how to decipher what CSS 2.1 wants us to do in
> this case (and with semicolon as first token in declaration-value, but
> browsers interoperably just drop the declaration.  It's currently
> undetectable whether this is because it's considered an overall
> violation of the Core Grammar, or because the empty value doesn't
> match any property's grammar.  I'm going with the latter for now,
> because it leaves the door open for the empty value for Variables.  I
> can change it if anyone feels strongly about it.

I only feel strongly that we should document the difference between
"Parse Error" and the CSS 2.1 "Core Grammar", so for whoever implements
this grammar (e.g. tinycss) this is still trackable. Or otherwise, the
simplest thing, as I've been saying, is to drop the "Parse Error"
concept at all (and say the difference is that "Core Grammar" is just
obsoleted).

>> == non-technical feedback ==
>>
>> If we choose to drop the "parse error", a lot of of branches in the
>> state machine can just merge into "anything else" and make some parts a
>> lot readable.
>>
>> [1] http://lists.w3.org/Archives/Public/www-style/2010Aug/0435
> 
> Overall I'm fine with loosening some of the restrictions, such as the
> "unused" production cited in that email.  But I'd like to start by
> just transforming the current spec and fixing it to match reality when
> necessary.

Just to make it clear, "unused" is allowed in a block in CSS 2.1. I am
not asking to make it loose.

The reality is that browsers don't implement the "Parse Error" thing. It
is just going to be quite confusing if a Web Console, when encountering
"width: <!--", says "violation of the core grammar in the value of
'width'" instead of just "unrecognized value in 'width'".

>>> I'm also interested in feedback about Issue 4, regarding how to
>>> specify the parser around at-rule block bodies.  What's the most
>>> useful way for me to specify that section, for someone implementing a
>>> CSS parser?
>>
>> I think in general at-rule-block parsing would just be Top-level parsing
>> with a special flag that says '}' ends the Top-level mode. It's likely
>> to be what's going on with a non-machine generated parer (similary,
>> @style value parsing will be "Declaration mode" parsing *without* such a
>> mode).
> 
> This is impossible - some at-rules allow declarations inside of them.
> Generally, we have to split at-rules into two camps - those whose
> insides are like top-level mode, and those whose insides are like
> declaration mode.

Indeed.



Cheers,
Kenny
Received on Tuesday, 12 June 2012 06:15:56 UTC