Re: [css3-syntax] First draft of parser section completed from Simon Sapin on 2012-06-12 (www-style@w3.org from June 2012)

From: Simon Sapin <simon.sapin@kozea.fr>
Date: Tue, 12 Jun 2012 16:48:31 +0200
To: "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>
CC: "Tab Atkins Jr." <jackalmage@gmail.com>, WWW Style <www-style@w3.org>
Message-ID: <4FD756BF.5050108@kozea.fr>
Le 12/06/2012 08:52, Kang-Hao (Kenny) Lu a écrit :
> (12/06/12 14:32), Simon Sapin wrote:
>> Le 12/06/2012 08:15, Kang-Hao (Kenny) Lu a écrit :
>>> I only feel strongly that we should document the difference between
>>> "Parse Error" and the CSS 2.1 "Core Grammar", so for whoever implements
>>> this grammar (e.g. tinycss) this is still trackable.
>>
>> I plan to update tinycss as soon as css3-syntax is stable enough.
>>
>> I realize this might be a breaking change for pretty much any usage of
>> tinycss, but I think that the project is still young enough to afford it.
>
> Can you provide some examples about this? Some objections to Core
> Grammar changes are based on the assumption that changing it is breaking
> tools, so it would be helpful to understand more about it.

Ok, here is an example:

tinycss 0.2 does not implement exactly the CSS 2.1 core grammar but 
something based on it. In particular, it has different token types for 
INTEGER and (non-integer) NUMBER.

In WeasyPrint 0.9 I have a function for each property that takes a list 
of tokens and parses the value. For example, the 'orphans' property 
checks that there is a single INTEGER token.

https://github.com/Kozea/WeasyPrint/blob/v0.9/weasyprint/css/validation.py#L652

Now if tinycss 0.3 changes to match css3-syntax, the INTEGER token type 
will disappear and NUMBER tokens will get an 'is_integer' flag. When 
WeasyPrint 0.9 gets such a token for 'orphans', it will incorrectly 
reject it as invalid. Therefore, tinycss 0.3 will be 
backward-incompatible with 0.2 and WeasyPrint will need to be adapted.

This is not too much of a problem because I maintain both, but breaking 
stuff like this is not very nice to other users of tinycss. (I don’t 
know of any, but maybe they just don’t tell me.)


> Also, I have some questions out of curiosity.
>
> 1. What is the benefit of making the CSS 2.1 parser throw when there's
> an input not following the core grammar? Would giving warnings be a
> better approach?

If you give it a string, tinycss is never supposed to raise an 
exception. (This is the Python name for what I assume you mean by 
"throw".) If it does, it’s a bug.

Instead, it is supposed to return a Stylesheet object. In additions to 
rules (statements), this object has a list of "parse errors". On an 
invalid input (that does not matches the core grammar), tinycss should 
read until the end of the declaration or rule and continue (this is the 
specified error recovery behavior) after logging a "parse error".

Maybe the "parse error" name is bad, because these are effectively 
warnings. Nothing fatal.


> 2. Is it possible to build a parser on top of tinycss which never throws
> and follows the error handling rules of CSS 2.1 like a browser?

That is what it should do. And what it does, as far as I know. I use 
exceptions internally for flow control, but these are not supposed to 
interrupt the parser or to be propagated to the user.

If you have a specific input that causes tinycss to raise an exception 
that is propagated to the user, it is a bug. I am interested to know 
about these. (Reports can go to the github issue tracker, the WeasyPrint 
mailing list, or private email to me.)


Selectors however are another story. I took over maintenance of 
cssselect after extracting it from lxml, but I’m not the original 
author. cssselect has its own tokenizer and parser which (in the current 
version, 0.6.1) is broken is more ways than I know. The git version is 
better (with backslash-escapes actually implemented) but can still 
produce XPath expressions (which in turn cause exceptions). This is 
work-in-progress.


> 3. Does tinycss, as it is, need a special conformance class so that it
> can be considered conforming (e.g. The HTML spec defines a bunch of
> non-browser conformance classes. It also says a UA can do the error
> handling *or* fail at the first error encountered.),

I don’t think that a special conformance class is needed. Actually 
css3-syntax already has this:

  # Certain points in the parsing algorithm are said to be parse errors.
  # The error handling for parse errors is well-defined: user agents
  # must either act as described below when encountering such problems,
  # or must abort processing at the first error that they encounter for
  # which they do not wish to apply the rules described below.

But I’m not sure that allowing to stop at the first "error" is a good 
idea. At least this is not what I want to do in my implementation. Error 
recovery and fallback are pretty fundamental in CSS.


> since there are a
> bunch of test cases in the test suite which will just make tinycss throw?

Such test cases just mean that I haven’t spent enough time testing. Is 
this in the CSS 2.1 test suite?


By the way, we have a "test runner":
python -m weasyprint.tests.w3_test_suite.web

It’s not really polished, packaged or documented but it is better than 
nothing. Please ask if you’re interested and I can help.

-- 
Simon Sapin
Received on Tuesday, 12 June 2012 14:48:58 UTC