Re: [css3-values] calc(), whitespace, and DIMENSION tokens from Paul Duffin on 2008-03-25 (www-style@w3.org from March 2008)

From: Paul Duffin <pduffin@volantis.com>
Date: Tue, 25 Mar 2008 12:26:07 -0600 (MDT)
To: fantasai <fantasai.lists@inkedblade.net>
Cc: www-style@w3.org
Message-ID: <1209128255.67211206469567640.JavaMail.root@zimbra.volantis.com>

fantasai wrote:
> Paul Duffin wrote:
>>
>> Not too much more complex than allowing dimensions but makes it much 
>> easier to specify, implement, and author. IMNSHO having to type a 
>> couple of extra characters is less onerous than having to remember 
>> lots of different rules about where white space is necessary and where 
>> it is not.
> 
> This is totally inconsistent with existing CSS syntax. Requiring whitespace
> between tokens is less of a burden than defining a new syntax for lengths.
> 

The new syntax would only be needed within expressions. The issue with 
the grammar is not that it requires whitespaces between tokens but that 
the same semantic construct can have a number of different possible 
tokenizations depending on the presence or not of whitespace.

e.g. within the nth-child() function
2n-1 <DIMENSION>
2n -1 <DIMENSION> <NUMBER>
2 n -1 <NUMBER> <IDENTIFIER> <NUMBER>

All of these are semantically the same and only differ in the use of 
whitespace (whether they are actually allowed at the moment is another 
matter). The nth-child() function is relatively simple but if arbitrary 
expressions are allowed then the problem will only get worse. e.g.

2 n - 1 <NUMBER> <IDENTIFIER> <OPERATOR> <NUMBER>
2n - 1 <DIMENSION> <OPERATOR> <NUMBER>

I do not know of any modern language (Fortran does have something
similar but its syntax is hardly modern) that has a tokenization strategy
that has this sort of behaviour. In fact I think that is one reason why
most modern languages have a restriction that identifiers cannot start 
with a digit (which is exactly what <DIMENSION> is).

The purpose of tokenization is to simplify the input in preparation for 
the syntax analysis. As it stands the tokenization does the opposite, 
increasing the number of combinations of tokens (even now I am not sure
that I have enumerated them all) making the grammar much more complex. 
In fact it is quite possible that in some cases the above tokenization 
would result in ambiguous grammars.

>> I am concerned that unless the syntax is clearly defined in a 
>> recognized format, e.g. BNF, then there will be all sorts of 
>> ambiguities that will be resolved by each implementation in different 
>> ways.
> 
> CSS syntax is usually defined in both prose and grammar productions.
> The ambiguities usually arise from the grammar not being precise
> enough to reflect constraints from the prose.
> 

Syntax should be defined first and foremost using standard grammar 
/ tokenizer mechanisms that can be automatically checked for ambiguity.
Prose should only be used to add constraints in exceptional 
circumstances. The more you rely on prose the more ambiguities (and hence
arguments) there will be in how it is supposed to behave with a 
corresponding detrimental impact on implementations.

My reference to compatibility with XPath was simply to raise the point that
XPath already has defined an expression language that can deal with 
identifiers containing "-"s and CSS should learn from that.

I agree that CSS must be easy to author but it is more than just the number
of characters they have to type. It also means that they must be able
to understand how they are supposed to write it, and have an expectation
that it will work across all browsers. These are just as important as the
former and are adversely impacted by a complex and ambiguous grammar.

Received on Tuesday, 25 March 2008 18:26:45 UTC