Re: [css-syntax] Dropping <number-token> representation, and its effects on <urange>

On Thu, Nov 20, 2014 at 5:24 AM, John Daggett <jdaggett@mozilla.com> wrote:
> Tab Atkins wrote:
>>> I can't say that I *like* this, but that's because I am
>>> philosophically not a fan of special tokenizer productions that
>>> only apply in specific grammar contexts -- can anyone think of a
>>> *practical* problem?  It's not any worse than unquoted url() in
>>> terms of code, it can't change the boundaries of a top-level
>>> construct, and the only other issue that comes to mind is that
>>> it'll make it harder to use <unicode-range-token> somewhere else
>>> in the future.  But I don't know that there *are* other uses, so.
>>
>> That requires a vastly more complicated change, switching the
>> Syntax module from being separate tokenizer/parser steps to being
>> integrated, with a lot more state being thrown around.  And it
>> doesn't help us if we ever want to use <urange> in another
>> property or context, which I think is plausible.
>
> Tab, the first line of your algorithm for handling <urange> sequences is [*]:
>
>   1. Skipping the first u token, concatenate the representations of
>      all the tokens in the production together (or, in the case of
>      <dimension-token>s, the representation followed by the unit).
>      Let this be text.
>
> Let's not kid ourselves here, that's basically taking the token soup
> that results from removing the UNICODE-RANGE token and says "take
> these tokens and start over from scratch". Calling these "separate
> tokenizer/parser steps" is basically bogus since your algorithm is
> effectively re-tokenizing the sequence within the parser.
>
> It would work just as well to say as part of selector parsing "if
> you see a unicode-range token, convert it to text and use this
> algorithm to come up with a selector". Both are hacks of equal standing,
> you won't be winning any design contests with either.

It's definitely arguable, but I don't think they're equal.  In
Selectors, the one token turns into three tokens, comprising pieces of
two compound selectors and a combinator.  That's really invasive from
a grammar POV; it means I basically have to do a preprocessing step
over the tokens before I can start actually matching a grammar against
them.

> I think if we were actually trying to create an accurate
> representation of <urange> in a grammar form, it would look
> something like:
>
>   <urange> =
>     ['u' | 'U'] '+' [ <hex-value> ['-' <hex value>]? ] |
>                     [ <hex-value>? '?'+ ]
>
> Here, <hex-value> would be a sequence of hexadecimal digits with the
> appropriate restrictions on number of digits and value range
> applied. I realize we don't have a clean way of representing
> <hex-value> as a sequence of CSS tokens currently and so the need
> for hacking.

Yes, that's what we'd do if we were defining grammars over codepoints.
But that's irrelevant, because we've lost the codepoints by the time
we apply grammars.

> The new syntax for <urange> in the Syntax spec now is an ugly change
> but, meh, we can make it work.

kk

~TJ

Received on Thursday, 20 November 2014 16:39:22 UTC