- From: Tab Atkins Jr. via GitHub <sysbot+gh@w3.org>
- Date: Sat, 02 Feb 2019 00:19:05 +0000
- To: public-css-archive@w3.org
An earlier thread: **Tab Atkins said:** > In the telcon today, dbaron expressed concern that the definition of > <urange> requires looking at the "representation" of <number-token>s > and <dimension-token>s. (The "representation" of a numeric token is > the actual text used to write the number, including leading 0s, > leading + sign, original base and exponent when using scientific > notation, etc.) > > I pointed out that storing the representation of numeric tokens is > already required, in order to implement the <quirky-color> production > from the Quirks Mode spec > <https://quirks.spec.whatwg.org/#the-hashless-hex-color-quirk>. IE's > behavior distinguishes between "color: 123;" and "color: 000123;", but > FF/WK/Blink don't; both are treated as #000123, so we can maybe change > the Quirks Mode spec to not require the representation. > > So, that leaves us with three possible resolutions to the <urange> thing. > > 1. Leave it as it is. This requires storing the representation on > every numeric token, which is a memory cost, but it lets us parse > <urange> precisely. (The cost might not be as bad as all that. If > you only store the representation when it's "non-obvious" (leading + > sign, leading 0, scinot) then the memory cost is *most* of the time > just a single null pointer per numeric token. You can regenerate the > representation on the fly from "obvious" forms, so a helper function > can be used to make representation-retrieval easy when it's > necessary.) > > 2. Drop the representation requirement, and rejigger the <urange> > definition to account for that. This has a few side effects: > 1) We can no longer limit the urange syntax to at most 6 hex > digits per component; arbitrary numbers of leading 0s will be allowed > and are impossible to detect. This just means that U+0000000 becomes > valid, for example. > 2) Four of the six grammar clauses "eat" the plus sign in the > following numeric token, and it's not detectable from the value that a > plus sign was ever used. The fact that whitespace is disallowed makes > this not a huge deal; in order to still hit the right token patterns, > you need to do some stupid comment tricks. "U/**/0001" will > technically become valid, and equivalent to "U+0001". > 3) Scinot is still a problem. "200", "200e0", "20e1", and "2e2" > all produce the same value when parsed as a <number-token>, but > obviously refer to four different codepoints when interpreted as hex > values. Numeric tokens would have to record if they were in scinot > form, and what the exponent was. > > 3. Revert this whole thing, and restore <unicode-range-token>. This > requires us to fix the original problem some other way. As a > refresher, the original issue was that "u+a { ... }" is a syntax > error, as the selector is a <unicode-range-token>, not <ident-token>, > +, <ident-token> like the author meant. Handling this in Selectors > requires us to essentially "retokenize" selectors, to turn *some* > <unicode-range-token>s into the expected token patterns; this would > have to be repeated for any other syntax that ends up with allowing > something looking like a unicode-range. It also means that non-CSS > implementations of Selectors have to do some silly back-and-forth > where they tokenize some strings into (meaningless) unicode-range > tokens and then immediately re-tokenize them back into useful stuff. > > > > I prefer solution #1 - doing it well increases the memory footprint of > a numeric token by the size of a pointer (generally doubling the size > of a <number-token>, but increasing the size of a <dimension-token> by > somewhat less), and allows us to handle <urange> exactly, without a > bunch of crazy hacks. > > #2 isn't so great. It means we're expanding the syntax of <urange>, > something dbaron didn't want to do in the first place, and it > increases the cost of numeric tokens anyway, as you have to remember > scinot exponents. I don't think this wins us much. > > #3 means that the unicode-range syntax infects Selectors, and any > future syntax we create that might have a + sign in it. (An+B avoids > it, since the only letter allowed is "n", and calc() avoids it by > requiring whitespace around the +, but we *almost* resolved to remove > the whitespace requirement, which would have put this back into the > realm of possibility once we allowed keywords in calc().) ------------- **Zack Weinburg said:** > Option 3a: Restore <unicode-range-token> but declare that it is only > considered as a tokenization within @font-face { ... }, or even only > within the unicode-range: descriptor within @font-face. > > I can't say that I *like* this, but that's because I am > philosophically not a fan of special tokenizer productions that only > apply in specific grammar contexts -- can anyone think of a > *practical* problem? It's not any worse than unquoted url() in terms > of code, it can't change the boundaries of a top-level construct, and > the only other issue that comes to mind is that it'll make it harder > to use <unicode-range-token> somewhere else in the future. But I > don't know that there *are* other uses, so. ------------- **Tab Atkins said:** > That requires a vastly more complicated change, switching the Syntax > module from being separate tokenizer/parser steps to being integrated, > with a lot more state being thrown around. And it doesn't help us if > we ever want to use <urange> in another property or context, which I > think is plausible. -------------- **L. David Baron said:** > > 1. Leave it as it is. This requires storing the representation on > > every numeric token, which is a memory cost, but it lets us parse > > <urange> precisely. (The cost might not be as bad as all that. If > > you only store the representation when it's "non-obvious" (leading + > > sign, leading 0, scinot) then the memory cost is *most* of the time > > just a single null pointer per numeric token. You can regenerate the > > representation on the fly from "obvious" forms, so a helper function > > can be used to make representation-retrieval easy when it's > > necessary.) > > I'm ok with this, and I think I prefer it at this point. -- GitHub Notification of comment by tabatkins Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/3588#issuecomment-459912056 using your GitHub account
Received on Saturday, 2 February 2019 00:19:07 UTC