Re: [css3-syntax] Reviving the spec, starting with the parser from Tab Atkins Jr. on 2012-04-12 (www-style@w3.org from April 2012)

From: Tab Atkins Jr. <jackalmage@gmail.com>
Date: Thu, 12 Apr 2012 08:22:31 -0700
To: Simon Sapin <simon.sapin@kozea.fr>
Cc: www-style list <www-style@w3.org>
Message-ID: <CAAWBYDBh8qSeZ_CP3hcdPZDkrXG9LKbAK9tTCWPkxcCTAsQf+A@mail.gmail.com>
On Thu, Apr 12, 2012 at 5:54 AM, Simon Sapin <simon.sapin@kozea.fr> wrote:
> Maybe it should be clarified near the start that whenever the rest of the
> text says "character", it really means "codepoint". The tokenizer and parser
> never need to know about Unicode normalization, combining characters, these
> kind of gory details.

This is expressed in the "overview of the parsing model" section,
where it explicitly defines the input as a stream of unicode code
points.  The rest of the algorithm consistently refers to code points
as well.


> Is there a reason to have a different handling of \r and \f? (U+000C and
> U+000D) Why not convert \f to \n, just like it is done for \r?

As far as I can tell, this would be fine to do.  \f is never treated
differently than \n (they both fall into the "newline" category, which
is the only way I ever mention them).


> There are a few mentions of "HTML DOM", "document.write()" and "insertion
> point". These do not seem necessary.

Yeah, the opening sections were just copied from HTML and lightly
modified.  I'll need to rewrite them more.


> The tokenizer sometimes looks ahead to decide what to do (eg. is '+'
> followed by a digit?), sometimes has more states (eg. hash state vs.
> hash-rest state). These two techniques look similar. Or they not equivalent?
> When both would work, is there a general principle to choose which to use
> when writing this spec?

I'm probably not 100% consistent, but I tried to always prefer using
more states, and only leaning on lookahead when it was necessary to
avoid emitting more than one token before returning to the "data"
state.

The two techniques are equivalent if you don't care about receiving
multiple tokens per call, or if you're okay with tracking additional
state to remember where you were in the parser, or if you're not doing
a piece-at-a-time scanner and are just tokenizing the whole thing at
once.  From what I understand, though, most implementations write
their tokenizer as a scanner that returns one token at a time, so I
tried to write the tokenizer to favor that style.


> Are the implementation required to actually have internally a state machine
> with the specified states, or can they do anything as long as they are
> equivalent? (Produce the same tokens on a given input.)

They just need to be black-box equivalent.  I'll make that clearer.
(I forget that CSS doesn't have that as a general policy, like HTML
does.)


> Backslash-unicode escapes and unicode ranges contain hexadecimal values for
> codepoints. What should happen when we parse a value that is outside the
> range of codepoints supported by the platform?
> css3-fonts (the only usage of unicode ranges that I know of) says that
> ranges are clipped.
> css21 mentions using U+FFFD or something similar for out-of-range escapes.
> Both of these behaviors should be defined in css3-syntax.

Hmm, I think I can include that unicode-range behavior.

Also, it appears I forgot that unicode-ranges can contain question
marks.  I'll need to handle that.

You're right about out-of-range escapes.  I'll make that change.

> I also suggest making the supported range implementation-dependent. The
> current highest unicode codepoint is 0x10ffff, but some "broken" platforms
> only support up to 0xffff (ie. only inside the BMP).

CSS doesn't currently allow platforms to not support all of unicode.
Do you have specific examples of platforms in use that are broken in
this way that we should support?


> Also, \0 sometimes have a special meaning and cannot be used in the middle
> of a string. This could be expressed by having the supported range start at
> U+0001 instead of U+0000.

I just tested Chrome, Firefox, and IE 8, and only Chrome handles a \0
in a string correctly.  Firefox bails and pretends I was trying to
escape a '0', and IE is just *weird* - it emits a replacement
character and then turns the remainder of the string into replacement
characters too.

Implementors, what do you think?  Is this a simple bug you could fix,
or is it better to replace NULLs with something else, like U+FFFD?

~TJ
Received on Thursday, 12 April 2012 15:23:24 UTC