- From: Tab Atkins Jr. via GitHub <sysbot+gh@w3.org>
- Date: Fri, 01 Feb 2019 23:45:23 +0000
- To: public-css-archive@w3.org
tabatkins has just created a new issue for https://github.com/w3c/csswg-drafts:
== [css-syntax] Wrapping up the <unicode-range> thing ==
(migrated from the mailing list)
**Tab Atkins said:**
> So, unicode ranges aren't settled right now, and I'd like to wrap them up.
>
> Quick history lesson:
>
> 1. Unicode ranges were originally defined as a CSS token. They have
> to be specially handled, because they don't look like any other token.
>
> 2. FF got some bug reports about the selector `u+a {...}` failing -
> the reason is because it parses as a unicode-range token, which is
> invalid for selectors.
>
> 3. I proposed we eliminate unicode-range as a token, and break it down
> into a complicated reimagining based on existing tokens, like I did
> for An+B.
>
>
> The major problem with this is that some hex numbers look like
> exponented numbers. For example, "U+04e4" is supposed to be Ӥ, but it
> parses as:
>
> ident(U) delim(+) number(40000)
>
> Obviously, 0x4e4 and 40000 are very different numbers! (U+40000 is
> actually invalid!) I currently solve this by keeping around the
> "representation" of the number token, which is the actual characters
> it was written with, but no impl does that, or is willing to keep
> around a string for every number and dimension they parse.
>
> So I think there are two ways we can handle this:
>
> 1. Abandon the project, restore <unicode-range-token>, and live with
> the fact that we have a weird almost-useless token that will
> occasionally cause problems for authors in unrelated contexts. (We
> can't even really do something like make Selectors treat unicode-range
> specially, because it can cut selectors in pieces - "u+area" parses as
> a urange(a) ident(rea)!)
>
> 2. Produce a new, reliable syntax for unicode ranges, and keep around
> the old version for back-compat, with a warning that some values won't
> parse correctly. The most obvious fix is to just replace the + with a
> -, like "U-0404", "U-400-600", or "U-4??". This makes the entire
> thing an ident, which keeps around the characters properly (or an
> ident followed by some ? delims, which is also fine).
>
> Thoughts?
------------
**Simon Sapin said:**
> On 22/06/15 17:26, Tab Atkins Jr. wrote:
> > So I think there are two ways we can handle this:
> >
> > 1. Abandon the project, restore <unicode-range-token>, and live with
> > the fact that we have a weird almost-useless token that will
> > occasionally cause problems for authors in unrelated contexts. (We
> > can't even really do something like make Selectors treat unicode-range
> > specially, because it can cut selectors in pieces - "u+area" parses as
> > a urange(a) ident(rea)!)
>
> Not sure if this is a good idea, but we *could* handle that in the
> Selectors grammar as well. u+a/**/rea would also parse, which we might
> not want, but it’s much harder for authors to accidentally do that than u+a.
>
>
> > 2. Produce a new, reliable syntax for unicode ranges, and keep around
> > the old version for back-compat, with a warning that some values won't
> > parse correctly. The most obvious fix is to just replace the + with a
> > -, like "U-0404", "U-400-600", or "U-4??". This makes the entire
> > thing an ident, which keeps around the characters properly (or an
> > ident followed by some ? delims, which is also fine).
>
> `unicode-range: U+04e4` works today in multiple browsers. Breaking this
> seems worse than the u+a selector not working. (Introducing an
> alternative unicode-range syntax will not help existing unmaintained
> content.)
---------------
**fantasai said:**
> I agree with Simon. We should not break unicode-range syntax here.
>
> If it's possible to fix this by munging the Selectors grammar,
> that seems like the best option. I'd argue that we may want to
> allow implementations to use context-specific parsing rules as
> well, if they want to go that route instead, so the UA would be
> allowed to either accept or reject u+a/**/rea. (A full CSS parser
> might not want to do that, but a Selectors parser shouldn't have
> to deal with unicode-range token munging. Ditto An+B, now I think
> about it.)
---------------
**Simon Sapin said:**
> Allowing a different behavior without mandating it reduces interop, and
> this doesn’t seem to be a good enough reason to do it.
-------------
**fantasai said:**
> The cases where there wouldn't be interop are just weird edge cases
> like u+a/**/rea, right? I don't think interop on that case is worth
> imposing the complexity of a CSS-token-munging parsing model on all
> non-CSS implementations of Selectors.
-------------
**Tab Atkins said:**
> On Fri, Jun 26, 2015 at 3:37 PM, Simon Sapin <simon.sapin@exyr.org> wrote:
> > `unicode-range: U+04e4` works today in multiple browsers. Breaking this
> > seems worse than the u+a selector not working. (Introducing an alternative
> > unicode-range syntax will not help existing unmaintained content.)
>
> There's a difference between "it works" and "it's used". I'm going to
> run some searches over our corpus and see if I can find any actual
> uses of unicode-ranges that look like scinot numbers.
Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/3591 using your GitHub account
Received on Friday, 1 February 2019 23:45:24 UTC