- From: Simon Sapin <simon.sapin@kozea.fr>
- Date: Thu, 12 Apr 2012 14:54:34 +0200
- To: "Tab Atkins Jr." <jackalmage@gmail.com>
- CC: www-style list <www-style@w3.org>
Le 12/04/2012 02:33, Tab Atkins Jr. a écrit : > I've currently gotten v1.0 of the tokenizer finished, temporarily > stored at<http://dev.w3.org/csswg/css3-syntax/parsing.html>. I'll > start on the tree-builder next that actually produces stylesheets. A few comments: Maybe it should be clarified near the start that whenever the rest of the text says "character", it really means "codepoint". The tokenizer and parser never need to know about Unicode normalization, combining characters, these kind of gory details. Is there a reason to have a different handling of \r and \f? (U+000C and U+000D) Why not convert \f to \n, just like it is done for \r? There are a few mentions of "HTML DOM", "document.write()" and "insertion point". These do not seem necessary. The tokenizer sometimes looks ahead to decide what to do (eg. is '+' followed by a digit?), sometimes has more states (eg. hash state vs. hash-rest state). These two techniques look similar. Or they not equivalent? When both would work, is there a general principle to choose which to use when writing this spec? Are the implementation required to actually have internally a state machine with the specified states, or can they do anything as long as they are equivalent? (Produce the same tokens on a given input.) Backslash-unicode escapes and unicode ranges contain hexadecimal values for codepoints. What should happen when we parse a value that is outside the range of codepoints supported by the platform? css3-fonts (the only usage of unicode ranges that I know of) says that ranges are clipped. css21 mentions using U+FFFD or something similar for out-of-range escapes. Both of these behaviors should be defined in css3-syntax. I also suggest making the supported range implementation-dependent. The current highest unicode codepoint is 0x10ffff, but some "broken" platforms only support up to 0xffff (ie. only inside the BMP). Also, \0 sometimes have a special meaning and cannot be used in the middle of a string. This could be expressed by having the supported range start at U+0001 instead of U+0000. -- Simon Sapin
Received on Thursday, 12 April 2012 12:55:05 UTC