[css-syntax][css-fonts] Wrapping up the <unicode-range> thing from Tab Atkins Jr. on 2015-06-23 (www-style@w3.org from June 2015)

From: Tab Atkins Jr. <jackalmage@gmail.com>
Date: Mon, 22 Jun 2015 17:26:08 -0700
To: www-style list <www-style@w3.org>
Message-ID: <CAAWBYDDkC9QiBAyUc5vGtxjpYwowHxjWv__m-Ev9LrVgag_knw@mail.gmail.com>

So, unicode ranges aren't settled right now, and I'd like to wrap them up.

Quick history lesson:

1. Unicode ranges were originally defined as a CSS token.  They have
to be specially handled, because they don't look like any other token.

2. FF got some bug reports about the selector `u+a {...}` failing -
the reason is because it parses as a unicode-range token, which is
invalid for selectors.

3. I proposed we eliminate unicode-range as a token, and break it down
into a complicated reimagining based on existing tokens, like I did
for An+B.


The major problem with this is that some hex numbers look like
exponented numbers.  For example, "U+04e4" is supposed to be Ӥ, but it
parses as:

ident(U) delim(+) number(40000)

Obviously, 0x4e4 and 40000 are very different numbers!  (U+40000 is
actually invalid!)  I currently solve this by keeping around the
"representation" of the number token, which is the actual characters
it was written with, but no impl does that, or is willing to keep
around a string for every number and dimension they parse.

So I think there are two ways we can handle this:

1. Abandon the project, restore <unicode-range-token>, and live with
the fact that we have a weird almost-useless token that will
occasionally cause problems for authors in unrelated contexts.  (We
can't even really do something like make Selectors treat unicode-range
specially, because it can cut selectors in pieces - "u+area" parses as
a urange(a) ident(rea)!)

2. Produce a new, reliable syntax for unicode ranges, and keep around
the old version for back-compat, with a warning that some values won't
parse correctly.  The most obvious fix is to just replace the + with a
-, like "U-0404", "U-400-600", or "U-4??".  This makes the entire
thing an ident, which keeps around the characters properly (or an
ident followed by some ? delims, which is also fine).

Thoughts?

~TJ

Received on Tuesday, 23 June 2015 00:26:56 UTC