Re: [csswg-drafts] [css-syntax] Urange and its problems (#3588) from Tab Atkins Jr. via GitHub on 2019-02-02 (public-css-archive@w3.org from February 2019)

From: Tab Atkins Jr. via GitHub <sysbot+gh@w3.org>
Date: Sat, 02 Feb 2019 00:19:05 +0000
To: public-css-archive@w3.org
Message-ID: <issue_comment.created-459912056-1549066744-sysbot+gh@w3.org>
An earlier thread:

**Tab Atkins said:**

> In the telcon today, dbaron expressed concern that the definition of
> <urange> requires looking at the "representation" of <number-token>s
> and <dimension-token>s.  (The "representation" of a numeric token is
> the actual text used to write the number, including leading 0s,
> leading + sign, original base and  exponent when using scientific
> notation, etc.)
> 
> I pointed out that storing the representation of numeric tokens is
> already required, in order to implement the <quirky-color> production
> from the Quirks Mode spec
> <https://quirks.spec.whatwg.org/#the-hashless-hex-color-quirk>.  IE's
> behavior distinguishes between "color: 123;" and "color: 000123;", but
> FF/WK/Blink don't; both are treated as #000123, so we can maybe change
> the Quirks Mode spec to not require the representation.
> 
> So, that leaves us with three possible resolutions to the <urange> thing.
> 
> 1. Leave it as it is.  This requires storing the representation on
> every numeric token, which is a memory cost, but it lets us parse
> <urange> precisely.  (The cost might not be as bad as all that.  If
> you only store the representation when it's "non-obvious" (leading +
> sign, leading 0, scinot) then the memory cost is *most* of the time
> just a single null pointer per numeric token.  You can regenerate the
> representation on the fly from "obvious" forms, so a helper function
> can be used to make representation-retrieval easy when it's
> necessary.)
> 
> 2. Drop the representation requirement, and rejigger the <urange>
> definition to account for that.  This has a few side effects:
>     1) We can no longer limit the urange syntax to at most 6 hex
> digits per component; arbitrary numbers of leading 0s will be allowed
> and are impossible to detect.  This just means that U+0000000 becomes
> valid, for example.
>     2) Four of the six grammar clauses "eat" the plus sign in the
> following numeric token, and it's not detectable from the value that a
> plus sign was ever used.  The fact that whitespace is disallowed makes
> this not a huge deal; in order to still hit the right token patterns,
> you need to do some stupid comment tricks.  "U/**/0001" will
> technically become valid, and equivalent to "U+0001".
>     3) Scinot is still a problem.  "200", "200e0", "20e1", and "2e2"
> all produce the same value when parsed as a <number-token>, but
> obviously refer to four different codepoints when interpreted as hex
> values.  Numeric tokens would have to record if they were in scinot
> form, and what the exponent was.
> 
> 3. Revert this whole thing, and restore <unicode-range-token>.  This
> requires us to fix the original problem some other way.  As a
> refresher, the original issue was that "u+a { ... }" is a syntax
> error, as the selector is a <unicode-range-token>, not <ident-token>,
> +, <ident-token> like the author meant.  Handling this in Selectors
> requires us to essentially "retokenize" selectors, to turn *some*
> <unicode-range-token>s into the expected token patterns; this would
> have to be repeated for any other syntax that ends up with allowing
> something looking like a unicode-range.  It also means that non-CSS
> implementations of Selectors have to do some silly back-and-forth
> where they tokenize some strings into (meaningless) unicode-range
> tokens and then immediately re-tokenize them back into useful stuff.
> 
> 
> 
> I prefer solution #1 - doing it well increases the memory footprint of
> a numeric token by the size of a pointer (generally doubling the size
> of a <number-token>, but increasing the size of a <dimension-token> by
> somewhat less), and allows us to handle <urange> exactly, without a
> bunch of crazy hacks.
> 
> #2 isn't so great. It means we're expanding the syntax of <urange>,
> something dbaron didn't want to do in the first place, and it
> increases the cost of numeric tokens anyway, as you have to remember
> scinot exponents.  I don't think this wins us much.
> 
> #3 means that the unicode-range syntax infects Selectors, and any
> future syntax we create that might have a + sign in it.  (An+B avoids
> it, since the only letter allowed is "n", and calc() avoids it by
> requiring whitespace around the +, but we *almost* resolved to remove
> the whitespace requirement, which would have put this back into the
> realm of possibility once we allowed keywords in calc().)

-------------

**Zack Weinburg said:**

> Option 3a: Restore <unicode-range-token> but declare that it is only
> considered as a tokenization within @font-face { ... }, or even only
> within the unicode-range: descriptor within @font-face.
> 
> I can't say that I *like* this, but that's because I am
> philosophically not a fan of special tokenizer productions that only
> apply in specific grammar contexts -- can anyone think of a
> *practical* problem?  It's not any worse than unquoted url() in terms
> of code, it can't change the boundaries of a top-level construct, and
> the only other issue that comes to mind is that it'll make it harder
> to use <unicode-range-token> somewhere else in the future.  But I
> don't know that there *are* other uses, so.

-------------

**Tab Atkins said:**

> That requires a vastly more complicated change, switching the Syntax
> module from being separate tokenizer/parser steps to being integrated,
> with a lot more state being thrown around.  And it doesn't help us if
> we ever want to use <urange> in another property or context, which I
> think is plausible.

--------------

**L. David Baron said:**

> > 1. Leave it as it is.  This requires storing the representation on
> > every numeric token, which is a memory cost, but it lets us parse
> > <urange> precisely.  (The cost might not be as bad as all that.  If
> > you only store the representation when it's "non-obvious" (leading +
> > sign, leading 0, scinot) then the memory cost is *most* of the time
> > just a single null pointer per numeric token.  You can regenerate the
> > representation on the fly from "obvious" forms, so a helper function
> > can be used to make representation-retrieval easy when it's
> > necessary.)
> 
> I'm ok with this, and I think I prefer it at this point.

-- 
GitHub Notification of comment by tabatkins
Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/3588#issuecomment-459912056 using your GitHub account
Received on Saturday, 2 February 2019 00:19:07 UTC