Re: [css-syntax] Removed <unicode-range-token>, please review from Tab Atkins Jr. on 2014-11-17 (www-style@w3.org from November 2014)

From: Tab Atkins Jr. <jackalmage@gmail.com>
Date: Mon, 17 Nov 2014 13:24:06 -0800
To: "L. David Baron" <dbaron@dbaron.org>
Cc: www-style list <www-style@w3.org>
Message-ID: <CAAWBYDCncH8XJ737o4Mj0Mf1oVJj5qfH4Zh9bjTC_L+By3=Exg@mail.gmail.com>

On Mon, Nov 17, 2014 at 11:31 AM, Tab Atkins Jr. <jackalmage@gmail.com> wrote:
> On Thu, Nov 13, 2014 at 5:27 PM, L. David Baron <dbaron@dbaron.org> wrote:
>> I'd prefer to leave which values are valid syntax and which aren't
>> the way they are; I don't see the point in introducing compatibility
>> risk without a good reason.  Unless, that is, implementation
>> behavior doesn't actually match the current spec.
>
> While I don't *generally* agree that this is necessary, looking over
> the complications of handling the syntax properly when I take scinot
> into account, I'm going to switch to an approach that makes "match the
> current syntax" easy to do.
>
> (I'm just going to claim all the token combinations that show up,
> regardless of what's in them, then concatenate and re-parse their
> representations.  This makes it much easier to the correct number of
> characters in each form. This makes <urange> a bit wider in
> syntax-space than I'd like, but it's not a big deal, and, like <anb>,
> you just have to be careful when using <urange> in new syntaxes in the
> future.)

And done.  Review appreciated; I ended up taking the old
<unicode-range-token> spec text and just generalizing it to be
error-detecting.  A valid <urange> now matches exactly the syntax of
the old <unicode-range-token>.

Some methodology information: to account for all possible token
combinations, I took the following primitive strings:

2
a
e
2a
2e
a2
e2
2a2
2e2
a2a
e2e
a2e
e2a
2a2a
2a2e
2e2a
2e2e
a2a2
a2e2
e2a2
e2e2
2a2a2
2a2e2
2e2a2
2e2e2
a2a2a
a2a2e
a2e2a
a2e2e
e2a2a
e2a2e
e2e2a
e2e2e

These should have captured every possibility regarding
number/ident/dimension parsing, including any scinot issues in
numbers/dimensions.

I then generated strings by running "u+{0}" on all of them, and again
witthen tested themh "u+{0}-{1}" (ranging over the cross-product of
the list with itself). I ran all of these through tokenizer at
<https://github.com/tabatkins/parse-css>, which matches the spec,
found all the unique token combinations so produced, and made a
grammar from that.  Those patterns which were produced by the first
set (without the - character) got an optional "?" tacked onto their
end in the grammar, and I added an extra clause just for the u+????
form.

I believe this is an exhaustive cover of the syntax possibilities.

~TJ

Received on Monday, 17 November 2014 21:24:53 UTC