[csswg-drafts] [css-syntax] Urange and its problems (#3588) from Tab Atkins Jr. via GitHub on 2019-02-01 (public-css-archive@w3.org from February 2019)

From: Tab Atkins Jr. via GitHub <sysbot+gh@w3.org>
Date: Fri, 01 Feb 2019 22:17:55 +0000
To: public-css-archive@w3.org
Message-ID: <issues.opened-405895544-1549059474-sysbot+gh@w3.org>
tabatkins has just created a new issue for https://github.com/w3c/csswg-drafts:

== [css-syntax] Urange and its problems ==
(migrated from the mailing list, for easier tracking here)

**Tab Atkins said:**
> History: CSS2.1 defined a special grammar token just for unicode
> ranges, which was used in exactly one place: the 'unicode-range'
> descriptor of @font-face.  This special production caused bugs in
> pages, where selectors like `u+a { ... }` were parsed as a
> UNICODE-RANGE token, rather than the expected "IDENT(u) DELIM(+)
> IDENT(a)", like every other selector of that form was parsed.  (This
> isn't theoretical - Moz had a bug reported against it for this.)
> 
> When writing the Syntax spec, I tried to fix this by dropping the
> unicode-range concept from the tokenizer, and instead handling it as a
> complex construct of the existing tokens, like I did with <an+b>.
> This kinda worked initially, but was *really* nasty.  Since then, we
> added scinot to numbers (like 1e3 for 1000), and this *completely
> destroyed* my ability to define <urange> cleanly - I can no longer use
> the value of numeric tokens, and instead have to rely on the
> "representation", which no browser stores or wants to store.
> 
> I want to go ahead and resolve this.  I can see three options:
> 
> 1. Keep what I'm currently doing.  This requires browsers to hold onto
> the string representation of numeric tokens (numbers and dimensions)
> at least through initial parsing (longer if they're used in a custom
> property).
> 
> 2. Abandon this effort, go back to having a special unicode-range
> token. Accept that this is weird and there are stupid side-effects,
> like some selectors not working.
> 
> 3. Define a new <urange> syntax that's actually simple to obtain from
> the existing tokens¹. Deprecate the old syntax; require UAs to accept
> the old syntax in the 'unicode-range' descriptor, but don't define how
> they should do so.  (Current UAs use context-sensitive retokenizing, I
> think - once they realize they're in a unicode-range descriptor,
> they'll retokenize the original text according to a special set of
> rules.)
> 
> Thoughts?
> 
> ¹ Simplest change is just to replace the + with a -, so you write
> `U-2016` for ‖. This makes unicode ranges always a single IDENT token,
> plus possibly some trailing '?' DELIM tokens.  You then have to parse
> the token's value to make sure it's a valid range, but that's way, way
> easier than the garbage fire I have to deal with from today's syntax.

-----------

**fantasai said:**
> Given unicode-range is already shipping
> <http://caniuse.com/#feat=font-unicode-range>
> I think #3 is a non-starter.
> 
> I would imagine that reparsing unicode-range tokens in order to make
> the selectors work would be easier than doing #1, no? Hanging onto
> unicode-range tokens would be a lot less memory than hanging onto
> numbers and dimensions, given they're used so rarely.

-------------

**Tab Atkins said:**
> On Tue, Apr 12, 2016 at 2:27 PM, fantasai <fantasai.lists@inkedblade.net> wrote:
> > Given unicode-range is already shipping
> >   http://caniuse.com/#feat=font-unicode-range
> > I think #3 is a non-starter.
> 
> You might have misread - #3 is explicitly backwards-compatible. It
> requires UAs to support the old syntax, it just doesn't describe how
> they would do so.
> 
> > I would imagine that reparsing unicode-range tokens in order to make
> > the selectors work would be easier than doing #1, no? Hanging onto
> > unicode-range tokens would be a lot less memory than hanging onto
> > numbers and dimensions, given they're used so rarely.
> 
> Yeah, it just means we have to reparse them everywhere *except* unicode-range.

-----------------------

**Florian Rivoal said:**
> > On Apr 13, 2016, at 07:09, Tab Atkins Jr. <jackalmage@gmail.com> wrote:
> > 
> > On Tue, Apr 12, 2016 at 2:27 PM, fantasai <fantasai.lists@inkedblade.net> wrote:
> >> Given unicode-range is already shipping
> >>  http://caniuse.com/#feat=font-unicode-range
> >> I think #3 is a non-starter.
> > 
> > You might have misread - #3 is explicitly backwards-compatible. It
> > requires UAs to support the old syntax, it just doesn't describe how
> > they would do so.
> 
> As a UA implementor who has this on the roadmap, I don't like having a spec telling us to do something, without telling us how. All UAs would probably do fine at supporting the old syntax when it is correctly used, but I am much less confident that we'd all pick the same logic for error handling, and it is important that we all react the same way in the face of unknown/incorrect syntax.
> 
> >> I would imagine that reparsing unicode-range tokens in order to make
> >> the selectors work would be easier than doing #1, no? Hanging onto
> >> unicode-range tokens would be a lot less memory than hanging onto
> >> numbers and dimensions, given they're used so rarely.
> > 
> > Yeah, it just means we have to reparse them everywhere *except* unicode-range.
> 
> Right, this feels ugly and error prone.

----------------

**Florian Rivoal said:**
> > On Apr 13, 2016, at 05:37, Tab Atkins Jr. <jackalmage@gmail.com> wrote:
> > 
> > 1. Keep what I'm currently doing.  This requires browsers to hold onto
> > the string representation of numeric tokens (numbers and dimensions)
> > at least through initial parsing (longer if they're used in a custom
> > property).
> 
> Does it really require that? Wouldn't it be good enough to hold onto the string representation of numeric tokens only when scinot is used? Given that scinot is pretty rare (and will stay that way), the memory requirement should be lower than storing the string representation of all numeric tokens.

----------------

**Simon Sapin said:**
> How about this?
> 
> 4. Same as 2, but tweak the Selector grammar to interpret unicode-range 
> tokens that don’t have question marks as: a type selector "u", followed 
> by a next-sibling combinator, followed by another type selector.
> 
> It’s weird, but it seems less messy to me than the alternatives.

-------------------

**Tab Atkins said:**
> Yeah. It really fucks up the grammar something *fierce*, so I think
> I'd have to do it as a preprocessing step before matching the actual
> Selectors grammar.  And anything else that ever wants to use a + is
> similarly affected; we seem to have settled on requiring spaces around
> math + and I don't expect us to use + for anything else, but custom
> properties would be stuck with this gotcha. :/

Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/3588 using your GitHub account
Received on Friday, 1 February 2019 22:17:57 UTC