Re: Selector parsing: It's easy to hit unexpected unicode-range tokens

On Mon, Jun 30, 2014 at 9:23 AM, Boris Zbarsky <bzbarsky@mit.edu> wrote:
> On 6/30/14, 11:12 AM, Simon Sapin wrote:
>>
>> On 30/06/14 15:34, Boris Zbarsky wrote:
>>>
>>> It seems to me like either we should not have a separate unicode-range
>>> token and instead handle unicode ranges on the parser level or we should
>>> have some sort of special token reprocessing logic in the selector
>>> parser.  My preference is very much for the former.
>>
>>
>> I think we can do the former with a definition similar to this
>> definition of <An+B> (the argument to :nth-child())
>>
>> http://dev.w3.org/csswg/css-syntax/#the-anb-type
>>
>> It’s ugly, but it’s well-defined and it seems to be the "least worst" we
>> can do here.
>
> I guess there is a third option too: tokenizer modes, such that u+a would be
> tokenized differently in different contexts.  I'm not sure how happy we are
> with that idea.

I'm not particularly happy with that idea; it requires either
intertwining the tokenizer and parser, or maintaining the original
text precisely enough during tokenization that it can be re-tokenized
with a different tokenizer during parsing.

I'm fine with dropping unicode-range as a token and just recognizing
it specially like we do with <an+b>.  It's a little complex, but no
more so that an+b is.  Philosophically, it occupies a similar space to
an+b - it's a weird special-purpose token that is only used for one
specific purpose, and is used in carefully controlled contexts (that
is, it's not generally mixed in with a bunch of other tokens in the
grammars where it's used).  I prefer making these have an ugly
token-based definition rather than continually running into these
weird special cases that we didn't consider previously.

~TJ

Received on Monday, 30 June 2014 21:27:15 UTC