[css-syntax] Dropping <number-token> representation, and its effects on <urange> from Tab Atkins Jr. on 2014-11-19 (www-style@w3.org from November 2014)

From: Tab Atkins Jr. <jackalmage@gmail.com>
Date: Wed, 19 Nov 2014 12:28:24 -0800
To: www-style list <www-style@w3.org>
Message-ID: <CAAWBYDDX9N+dM_MAvA+KEtuUdvrHMkT=R2GVOZ=MuuOs_FG6Mw@mail.gmail.com>
In the telcon today, dbaron expressed concern that the definition of
<urange> requires looking at the "representation" of <number-token>s
and <dimension-token>s.  (The "representation" of a numeric token is
the actual text used to write the number, including leading 0s,
leading + sign, original base and  exponent when using scientific
notation, etc.)

I pointed out that storing the representation of numeric tokens is
already required, in order to implement the <quirky-color> production
from the Quirks Mode spec
<https://quirks.spec.whatwg.org/#the-hashless-hex-color-quirk>.  IE's
behavior distinguishes between "color: 123;" and "color: 000123;", but
FF/WK/Blink don't; both are treated as #000123, so we can maybe change
the Quirks Mode spec to not require the representation.

So, that leaves us with three possible resolutions to the <urange> thing.

1. Leave it as it is.  This requires storing the representation on
every numeric token, which is a memory cost, but it lets us parse
<urange> precisely.  (The cost might not be as bad as all that.  If
you only store the representation when it's "non-obvious" (leading +
sign, leading 0, scinot) then the memory cost is *most* of the time
just a single null pointer per numeric token.  You can regenerate the
representation on the fly from "obvious" forms, so a helper function
can be used to make representation-retrieval easy when it's
necessary.)

2. Drop the representation requirement, and rejigger the <urange>
definition to account for that.  This has a few side effects:
    1) We can no longer limit the urange syntax to at most 6 hex
digits per component; arbitrary numbers of leading 0s will be allowed
and are impossible to detect.  This just means that U+0000000 becomes
valid, for example.
    2) Four of the six grammar clauses "eat" the plus sign in the
following numeric token, and it's not detectable from the value that a
plus sign was ever used.  The fact that whitespace is disallowed makes
this not a huge deal; in order to still hit the right token patterns,
you need to do some stupid comment tricks.  "U/**/0001" will
technically become valid, and equivalent to "U+0001".
    3) Scinot is still a problem.  "200", "200e0", "20e1", and "2e2"
all produce the same value when parsed as a <number-token>, but
obviously refer to four different codepoints when interpreted as hex
values.  Numeric tokens would have to record if they were in scinot
form, and what the exponent was.

3. Revert this whole thing, and restore <unicode-range-token>.  This
requires us to fix the original problem some other way.  As a
refresher, the original issue was that "u+a { ... }" is a syntax
error, as the selector is a <unicode-range-token>, not <ident-token>,
+, <ident-token> like the author meant.  Handling this in Selectors
requires us to essentially "retokenize" selectors, to turn *some*
<unicode-range-token>s into the expected token patterns; this would
have to be repeated for any other syntax that ends up with allowing
something looking like a unicode-range.  It also means that non-CSS
implementations of Selectors have to do some silly back-and-forth
where they tokenize some strings into (meaningless) unicode-range
tokens and then immediately re-tokenize them back into useful stuff.



I prefer solution #1 - doing it well increases the memory footprint of
a numeric token by the size of a pointer (generally doubling the size
of a <number-token>, but increasing the size of a <dimension-token> by
somewhat less), and allows us to handle <urange> exactly, without a
bunch of crazy hacks.

#2 isn't so great. It means we're expanding the syntax of <urange>,
something dbaron didn't want to do in the first place, and it
increases the cost of numeric tokens anyway, as you have to remember
scinot exponents.  I don't think this wins us much.

#3 means that the unicode-range syntax infects Selectors, and any
future syntax we create that might have a + sign in it.  (An+B avoids
it, since the only letter allowed is "n", and calc() avoids it by
requiring whitespace around the +, but we *almost* resolved to remove
the whitespace requirement, which would have put this back into the
realm of possibility once we allowed keywords in calc().)

~TJ
Received on Wednesday, 19 November 2014 20:29:12 UTC