Re: [css3-syntax] Reviving the spec, starting with the parser from Simon Sapin on 2012-04-12 (www-style@w3.org from April 2012)

From: Simon Sapin <simon.sapin@kozea.fr>
Date: Thu, 12 Apr 2012 14:54:34 +0200
To: "Tab Atkins Jr." <jackalmage@gmail.com>
CC: www-style list <www-style@w3.org>
Message-ID: <4F86D08A.5020500@kozea.fr>

Le 12/04/2012 02:33, Tab Atkins Jr. a écrit :
> I've currently gotten v1.0 of the tokenizer finished, temporarily
> stored at<http://dev.w3.org/csswg/css3-syntax/parsing.html>.  I'll
> start on the tree-builder next that actually produces stylesheets.


A few comments:


Maybe it should be clarified near the start that whenever the rest of 
the text says "character", it really means "codepoint". The tokenizer 
and parser never need to know about Unicode normalization, combining 
characters, these kind of gory details.


Is there a reason to have a different handling of \r and \f? (U+000C and 
U+000D) Why not convert \f to \n, just like it is done for \r?


There are a few mentions of "HTML DOM", "document.write()" and 
"insertion point". These do not seem necessary.


The tokenizer sometimes looks ahead to decide what to do (eg. is '+' 
followed by a digit?), sometimes has more states (eg. hash state vs. 
hash-rest state). These two techniques look similar. Or they not 
equivalent? When both would work, is there a general principle to choose 
which to use when writing this spec?


Are the implementation required to actually have internally a state 
machine with the specified states, or can they do anything as long as 
they are equivalent? (Produce the same tokens on a given input.)


Backslash-unicode escapes and unicode ranges contain hexadecimal values 
for codepoints. What should happen when we parse a value that is outside 
the range of codepoints supported by the platform?
css3-fonts (the only usage of unicode ranges that I know of) says that 
ranges are clipped.
css21 mentions using U+FFFD or something similar for out-of-range escapes.
Both of these behaviors should be defined in css3-syntax.

I also suggest making the supported range implementation-dependent. The 
current highest unicode codepoint is 0x10ffff, but some "broken" 
platforms only support up to 0xffff (ie. only inside the BMP).
Also, \0 sometimes have a special meaning and cannot be used in the 
middle of a string. This could be expressed by having the supported 
range start at U+0001 instead of U+0000.

-- 
Simon Sapin

Received on Thursday, 12 April 2012 12:55:05 UTC