- From: Tab Atkins Jr. <jackalmage@gmail.com>
- Date: Tue, 12 Apr 2016 13:37:54 -0700
- To: www-style list <www-style@w3.org>
History: CSS2.1 defined a special grammar token just for unicode ranges, which was used in exactly one place: the 'unicode-range' descriptor of @font-face. This special production caused bugs in pages, where selectors like `u+a { ... }` were parsed as a UNICODE-RANGE token, rather than the expected "IDENT(u) DELIM(+) IDENT(a)", like every other selector of that form was parsed. (This isn't theoretical - Moz had a bug reported against it for this.) When writing the Syntax spec, I tried to fix this by dropping the unicode-range concept from the tokenizer, and instead handling it as a complex construct of the existing tokens, like I did with <an+b>. This kinda worked initially, but was *really* nasty. Since then, we added scinot to numbers (like 1e3 for 1000), and this *completely destroyed* my ability to define <urange> cleanly - I can no longer use the value of numeric tokens, and instead have to rely on the "representation", which no browser stores or wants to store. I want to go ahead and resolve this. I can see three options: 1. Keep what I'm currently doing. This requires browsers to hold onto the string representation of numeric tokens (numbers and dimensions) at least through initial parsing (longer if they're used in a custom property). 2. Abandon this effort, go back to having a special unicode-range token. Accept that this is weird and there are stupid side-effects, like some selectors not working. 3. Define a new <urange> syntax that's actually simple to obtain from the existing tokens¹. Deprecate the old syntax; require UAs to accept the old syntax in the 'unicode-range' descriptor, but don't define how they should do so. (Current UAs use context-sensitive retokenizing, I think - once they realize they're in a unicode-range descriptor, they'll retokenize the original text according to a special set of rules.) Thoughts? ¹ Simplest change is just to replace the + with a -, so you write `U-2016` for ‖. This makes unicode ranges always a single IDENT token, plus possibly some trailing '?' DELIM tokens. You then have to parse the token's value to make sure it's a valid range, but that's way, way easier than the garbage fire I have to deal with from today's syntax. ~TJ
Received on Tuesday, 12 April 2016 20:38:40 UTC