Re: Forget what I said about whitespace from Bert Bos on 1997-12-19 (www-style@w3.org from December 1997)

From: Bert Bos <Bert.Bos@sophia.inria.fr>
Date: Fri, 19 Dec 1997 20:07:17 +0100 (MET)
To: neil@bigpic.com
Cc: www-style@w3.org
Message-Id: <199712191907.UAA08162@mygale.inria.fr>
Neil St.Laurent writes:
 > > I think you misread the specification. As far as I can see, both
 > > browsers follow the CSS1 specification quite well..
 > 
 > "WHITESPACE and COMMENT tokens do not occur in the grammar (to keep it 
 > readable), but any number of these tokens may appear anywhere. The 
 > content of these tokens (the matched text) doesn't matter, but their 
 > presence or absence may change the interpretation of some part of the 
 > style sheet.  For example, in CSS2 the WHITESPACE is significant in 
 > selectors."
 > 
 > That to me would seem to imply that WHITESPACE is of NO significance 
 > inside declaration sets and that "red  green" is treated like 
 > "redgreen".  This would be consistant with the assumed effort in the 
 > standard to ensure that no property tokens can accidently form other 
 > tokens when in combination -- and the longest ones can always be 
 > checked first.

English is not a formal language, and what "significant" means is not
expressible as a logical formula. Spaces are not significant in the
sense that

   <TOKEN1><TOKEN2><TOKEN3>

is the same as

   <TOKEN1> <TOKEN2> <TOKEN3>

is the same as

   <TOKEN1>        <TOKEN2>               <TOKEN3>

Of course, you have to recognize the tokens first, and standard
practice (and implemented in flex) is to take the longest string in
case of ambiguities. So

    redgreen"xyz"

is

    <TOKEN:redgreen><TOKEN:"xyz">

which is indeed the same as

    redgreen  "xyz"

It is clearly not the same as

    red green "xyz"

since this contains three tokens, not two. Whether you want to blame
it on the whitespace, or just on the fact that an identifier can only
consist of letters and digits, is a matter of interpretation.

 > 
 > > So if you, as a CSS1 parser, encounter this, and you know who wrote
 > > it, please notify the author that there probably is an error in his
 > > style sheet, somewhere near "repeaturl"..
 > 
 > What you are implying with these statements is that the YACC style 
 > grammar presented in the back of the draft is not for convenience, 
 > but is actually the specification for the standard.  In which case it 
 > isn't necessarily consistant with the wording of the standard, and 
 > there were errors in the macros.

The grammar is indeed normative. It says so much at the start of the
section.

Bugs can always occur. It is regrettable, but hopefully they are small
enough that people can guess what it should have been. We will publish
a list of errata soon.

 >  
 > > What will you do if there is a keyword that is a prefix of another:
 > > say we add "greenish", will you parse that as "green" + "ish"? Or a
 > > more practical example: it is likely that we will have a
 > > pseudo-class ":first" in CSS2, will that cause your parser to forget
 > > about the pseudo-elements ":first-letter" and ":first-line"?
 > 
 > Standard parsing practice is to check the longest string first, 
 > additionally you are using selectors as an example but they appear to 
 > have a very clear syntax.
 >  
 > > We are aware that the significance of whitespace in the selectors
 > > makes parsing slightly harder, but there is nothing special about
 > > spaces on the right hand side. Like in most other languages, a token
 > > is always as long as possible. Thus "repeaturl" is only one
 > > identifier, and not two (or three, or four, or...)
 > 
 > What you are indicating again is that WHITESPACE does have 
 > significance in that it breaks apart tokens.  If this is the 
 > assumptiong made in YACC then it should also be clearly stated inside 
 > the CSS2 draft -- which currenlty only says that WHITESPACE only has 
 > significance in selectors.

You can interpret it that way, if you want, but many other things
"break apart tokens" as well. The string

    red,

has two tokens, simply because there is no token that can consist of
all four characters. Same for

    red green

Since there is no token that can include the space.

 >  
 > > That does indeed mean that you may have to put in some spaces when
 > > you write out a style sheet. Butwhatismorenaturalthanthat?
 > 
 > The spaces may be convenient, but for the most part they are not 
 > necessary at all -- with the exception of the flawed font rule with 
 > respect to face names (which really makes the whole WHITESPACE tokens 
 > having no significance point flawed).

That is not true. The declaration:

    font-family: times
                 new/* no space! */roman

has a value consisting of three tokens: "times", "new" and
"roman". The description of 'font-family' explains that in the case of
a value that consists of multiple tokens, the real font name is found
by concatenating them with a single space in between. Thus the font
name is actually "times new roman"

So, again, the spaces are not significant. The spaces that were there
were ignored, and other spaces that weren't there are inserted.

 > 
 > It is also interesting that CDO and CDC aren't just considered to be 
 > a part of the whitespace?

That is rather arbitrary. We wanted to put them in, so people could
more easily cut and paste style sheets between documents, but it was
not necessary to allow them everywhere. Showing them in the grammar
made them more visible than putting them in the scanner, so that is
what we did.

 > 
 > What I'm really saying is that the specification for the syntax rests 
 > on too many assumptions about existing tokenizers/lexers such as 
 > yacc/flex.
 > 
 > It is no real effort on our part to change our parser accordingly 
 > (which by the way we use a single parser that is not broken down 
 > into a tokenizer and lexer).
 > 
 > One other note (likely for clarity, but this small inconsistencies 
 > eventually add up to confusion) is that the grammar in 4.1.1 
 > doesn't use the previously stated TOKENS, and inside inserts single 
 > characters directly into the grammar.

The tokens would appear in error-handling rules, which are not shown
in the grammar. It depends too much on the way you write your program
how you handle the skipping of incorrect rules, and trying to include
it into the grammar would obscure the grammar too much.

 > 
 > One last point:
 > P { background: red; }
 > Since this incorrect declaration is so prevalent I think we should 
 > extend the syntax to allow a semicolon after the last property:value 
 > pair, otherwise the correct interpretation of this would not be to do 
 > anything to the background...

The semicolon is already legal. The grammar says that a declaration
may be empty, so, for example,

    background: red;;;

is legal, and has four declarations, but three of them are empty.



Bert
-- 
  Bert Bos                                ( W 3 C ) http://www.w3.org/
  http://www.w3.org/people/bos/                              W3C/INRIA
  bert@w3.org                             2004 Rt des Lucioles / BP 93
  +33 (0)4 93 65 76 92            06902 Sophia Antipolis Cedex, France
  +33 (0)4 92 38 76 92 (<--- after 5 Jan 1998)
Received on Friday, 19 December 1997 14:08:26 UTC