[csswg-drafts] [css-syntax] Wrapping up the <unicode-range> thing (#3591) from Tab Atkins Jr. via GitHub on 2019-02-01 (public-css-archive@w3.org from February 2019)

From: Tab Atkins Jr. via GitHub <sysbot+gh@w3.org>
Date: Fri, 01 Feb 2019 23:45:23 +0000
To: public-css-archive@w3.org
Message-ID: <issues.opened-405916235-1549064722-sysbot+gh@w3.org>
tabatkins has just created a new issue for https://github.com/w3c/csswg-drafts:

== [css-syntax] Wrapping up the <unicode-range> thing ==
(migrated from the mailing list)

**Tab Atkins said:**

> So, unicode ranges aren't settled right now, and I'd like to wrap them up.
> 
> Quick history lesson:
> 
> 1. Unicode ranges were originally defined as a CSS token.  They have
> to be specially handled, because they don't look like any other token.
> 
> 2. FF got some bug reports about the selector `u+a {...}` failing -
> the reason is because it parses as a unicode-range token, which is
> invalid for selectors.
> 
> 3. I proposed we eliminate unicode-range as a token, and break it down
> into a complicated reimagining based on existing tokens, like I did
> for An+B.
> 
> 
> The major problem with this is that some hex numbers look like
> exponented numbers.  For example, "U+04e4" is supposed to be Ӥ, but it
> parses as:
> 
> ident(U) delim(+) number(40000)
> 
> Obviously, 0x4e4 and 40000 are very different numbers!  (U+40000 is
> actually invalid!)  I currently solve this by keeping around the
> "representation" of the number token, which is the actual characters
> it was written with, but no impl does that, or is willing to keep
> around a string for every number and dimension they parse.
> 
> So I think there are two ways we can handle this:
> 
> 1. Abandon the project, restore <unicode-range-token>, and live with
> the fact that we have a weird almost-useless token that will
> occasionally cause problems for authors in unrelated contexts.  (We
> can't even really do something like make Selectors treat unicode-range
> specially, because it can cut selectors in pieces - "u+area" parses as
> a urange(a) ident(rea)!)
> 
> 2. Produce a new, reliable syntax for unicode ranges, and keep around
> the old version for back-compat, with a warning that some values won't
> parse correctly.  The most obvious fix is to just replace the + with a
> -, like "U-0404", "U-400-600", or "U-4??".  This makes the entire
> thing an ident, which keeps around the characters properly (or an
> ident followed by some ? delims, which is also fine).
> 
> Thoughts?

------------

**Simon Sapin said:**

> On 22/06/15 17:26, Tab Atkins Jr. wrote:
> > So I think there are two ways we can handle this:
> >
> > 1. Abandon the project, restore <unicode-range-token>, and live with
> > the fact that we have a weird almost-useless token that will
> > occasionally cause problems for authors in unrelated contexts.  (We
> > can't even really do something like make Selectors treat unicode-range
> > specially, because it can cut selectors in pieces - "u+area" parses as
> > a urange(a) ident(rea)!)
> 
> Not sure if this is a good idea, but we *could* handle that in the 
> Selectors grammar as well. u+a/**/rea would also parse, which we might 
> not want, but it’s much harder for authors to accidentally do that than u+a.
> 
> 
> > 2. Produce a new, reliable syntax for unicode ranges, and keep around
> > the old version for back-compat, with a warning that some values won't
> > parse correctly.  The most obvious fix is to just replace the + with a
> > -, like "U-0404", "U-400-600", or "U-4??".  This makes the entire
> > thing an ident, which keeps around the characters properly (or an
> > ident followed by some ? delims, which is also fine).
> 
> `unicode-range: U+04e4` works today in multiple browsers. Breaking this 
> seems worse than the u+a selector not working. (Introducing an 
> alternative unicode-range syntax will not help existing unmaintained 
> content.)

---------------

**fantasai said:**

> I agree with Simon. We should not break unicode-range syntax here.
> 
> If it's possible to fix this by munging the Selectors grammar,
> that seems like the best option. I'd argue that we may want to
> allow implementations to use context-specific parsing rules as
> well, if they want to go that route instead, so the UA would be
> allowed to either accept or reject u+a/**/rea. (A full CSS parser
> might not want to do that, but a Selectors parser shouldn't have
> to deal with unicode-range token munging. Ditto An+B, now I think
> about it.)

---------------

**Simon Sapin said:**
> Allowing a different behavior without mandating it reduces interop, and 
> this doesn’t seem to be a good enough reason to do it.

-------------

**fantasai said:**

> The cases where there wouldn't be interop are just weird edge cases
> like u+a/**/rea, right? I don't think interop on that case is worth
> imposing the complexity of a CSS-token-munging parsing model on all
> non-CSS implementations of Selectors.

-------------

**Tab Atkins said:**

> On Fri, Jun 26, 2015 at 3:37 PM, Simon Sapin <simon.sapin@exyr.org> wrote:
> > `unicode-range: U+04e4` works today in multiple browsers. Breaking this
> > seems worse than the u+a selector not working. (Introducing an alternative
> > unicode-range syntax will not help existing unmaintained content.)
> 
> There's a difference between "it works" and "it's used". I'm going to
> run some searches over our corpus and see if I can find any actual
> uses of unicode-ranges that look like scinot numbers.

Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/3591 using your GitHub account
Received on Friday, 1 February 2019 23:45:24 UTC