[css-syntax] <urange> and it's problems from Tab Atkins Jr. on 2016-04-12 (www-style@w3.org from April 2016)

From: Tab Atkins Jr. <jackalmage@gmail.com>
Date: Tue, 12 Apr 2016 13:37:54 -0700
To: www-style list <www-style@w3.org>
Message-ID: <CAAWBYDCnSwFVvxMpOGMCa=_DbrFzXe6zZ_PaRL+Jhk8nnwDKKA@mail.gmail.com>

History: CSS2.1 defined a special grammar token just for unicode
ranges, which was used in exactly one place: the 'unicode-range'
descriptor of @font-face.  This special production caused bugs in
pages, where selectors like `u+a { ... }` were parsed as a
UNICODE-RANGE token, rather than the expected "IDENT(u) DELIM(+)
IDENT(a)", like every other selector of that form was parsed.  (This
isn't theoretical - Moz had a bug reported against it for this.)

When writing the Syntax spec, I tried to fix this by dropping the
unicode-range concept from the tokenizer, and instead handling it as a
complex construct of the existing tokens, like I did with <an+b>.
This kinda worked initially, but was *really* nasty.  Since then, we
added scinot to numbers (like 1e3 for 1000), and this *completely
destroyed* my ability to define <urange> cleanly - I can no longer use
the value of numeric tokens, and instead have to rely on the
"representation", which no browser stores or wants to store.

I want to go ahead and resolve this.  I can see three options:

1. Keep what I'm currently doing.  This requires browsers to hold onto
the string representation of numeric tokens (numbers and dimensions)
at least through initial parsing (longer if they're used in a custom
property).

2. Abandon this effort, go back to having a special unicode-range
token. Accept that this is weird and there are stupid side-effects,
like some selectors not working.

3. Define a new <urange> syntax that's actually simple to obtain from
the existing tokens¹. Deprecate the old syntax; require UAs to accept
the old syntax in the 'unicode-range' descriptor, but don't define how
they should do so.  (Current UAs use context-sensitive retokenizing, I
think - once they realize they're in a unicode-range descriptor,
they'll retokenize the original text according to a special set of
rules.)

Thoughts?

¹ Simplest change is just to replace the + with a -, so you write
`U-2016` for ‖. This makes unicode ranges always a single IDENT token,
plus possibly some trailing '?' DELIM tokens.  You then have to parse
the token's value to make sure it's a valid range, but that's way, way
easier than the garbage fire I have to deal with from today's syntax.

~TJ

Received on Tuesday, 12 April 2016 20:38:40 UTC