- From: Tab Atkins Jr. <jackalmage@gmail.com>
- Date: Wed, 19 Nov 2014 12:28:24 -0800
- To: www-style list <www-style@w3.org>
In the telcon today, dbaron expressed concern that the definition of <urange> requires looking at the "representation" of <number-token>s and <dimension-token>s. (The "representation" of a numeric token is the actual text used to write the number, including leading 0s, leading + sign, original base and exponent when using scientific notation, etc.) I pointed out that storing the representation of numeric tokens is already required, in order to implement the <quirky-color> production from the Quirks Mode spec <https://quirks.spec.whatwg.org/#the-hashless-hex-color-quirk>. IE's behavior distinguishes between "color: 123;" and "color: 000123;", but FF/WK/Blink don't; both are treated as #000123, so we can maybe change the Quirks Mode spec to not require the representation. So, that leaves us with three possible resolutions to the <urange> thing. 1. Leave it as it is. This requires storing the representation on every numeric token, which is a memory cost, but it lets us parse <urange> precisely. (The cost might not be as bad as all that. If you only store the representation when it's "non-obvious" (leading + sign, leading 0, scinot) then the memory cost is *most* of the time just a single null pointer per numeric token. You can regenerate the representation on the fly from "obvious" forms, so a helper function can be used to make representation-retrieval easy when it's necessary.) 2. Drop the representation requirement, and rejigger the <urange> definition to account for that. This has a few side effects: 1) We can no longer limit the urange syntax to at most 6 hex digits per component; arbitrary numbers of leading 0s will be allowed and are impossible to detect. This just means that U+0000000 becomes valid, for example. 2) Four of the six grammar clauses "eat" the plus sign in the following numeric token, and it's not detectable from the value that a plus sign was ever used. The fact that whitespace is disallowed makes this not a huge deal; in order to still hit the right token patterns, you need to do some stupid comment tricks. "U/**/0001" will technically become valid, and equivalent to "U+0001". 3) Scinot is still a problem. "200", "200e0", "20e1", and "2e2" all produce the same value when parsed as a <number-token>, but obviously refer to four different codepoints when interpreted as hex values. Numeric tokens would have to record if they were in scinot form, and what the exponent was. 3. Revert this whole thing, and restore <unicode-range-token>. This requires us to fix the original problem some other way. As a refresher, the original issue was that "u+a { ... }" is a syntax error, as the selector is a <unicode-range-token>, not <ident-token>, +, <ident-token> like the author meant. Handling this in Selectors requires us to essentially "retokenize" selectors, to turn *some* <unicode-range-token>s into the expected token patterns; this would have to be repeated for any other syntax that ends up with allowing something looking like a unicode-range. It also means that non-CSS implementations of Selectors have to do some silly back-and-forth where they tokenize some strings into (meaningless) unicode-range tokens and then immediately re-tokenize them back into useful stuff. I prefer solution #1 - doing it well increases the memory footprint of a numeric token by the size of a pointer (generally doubling the size of a <number-token>, but increasing the size of a <dimension-token> by somewhat less), and allows us to handle <urange> exactly, without a bunch of crazy hacks. #2 isn't so great. It means we're expanding the syntax of <urange>, something dbaron didn't want to do in the first place, and it increases the cost of numeric tokens anyway, as you have to remember scinot exponents. I don't think this wins us much. #3 means that the unicode-range syntax infects Selectors, and any future syntax we create that might have a + sign in it. (An+B avoids it, since the only letter allowed is "n", and calc() avoids it by requiring whitespace around the +, but we *almost* resolved to remove the whitespace requirement, which would have put this back into the realm of possibility once we allowed keywords in calc().) ~TJ
Received on Wednesday, 19 November 2014 20:29:12 UTC