Re: [css-syntax] Reverting <unicode-range> changes from CSS 2.1 from Tab Atkins Jr. on 2013-09-02 (www-style@w3.org from September 2013)

From: Tab Atkins Jr. <jackalmage@gmail.com>
Date: Mon, 2 Sep 2013 00:57:29 -0700
To: Simon Sapin <simon.sapin@exyr.org>
Cc: www-style <www-style@w3.org>, John Daggett <jdaggett@mozilla.com>
Message-ID: <CAAWBYDC6WU0ZXMiYMaz3YsXMD8_VnpzPNpyyKoozjB6oiUK3MQ@mail.gmail.com>
On Sun, Sep 1, 2013 at 2:44 PM, Simon Sapin <simon.sapin@exyr.org> wrote:
> Le 01/09/2013 18:06, Tab Atkins Jr. a écrit :
>> The second bullet point just doesn't make any sense, though, unless
>> we're worried about accidental usages of the token in generic contexts
>> like custom properties.
>
> I don’t think this affects custom properties, whose value is either
> (eventually) used in a non-custom property, or serialized back to a CSS
> string.

What I meant was that if we reject some types of things that look like
a unicode-range, and you accidentally use that particular sequence of
characters in a custom property, it won't round-trip serialize.  For
example, the text you removed from Syntax would have parsed "U+999999"
to a <unicode-range> token with an empty range.  I'm not actually sure
how I intended an empty range to be serialized, but it probably
wouldn't be as "U+999999". "U+1000-999999" would probably have been
serialized as "U+1000-10FFFF", since that's what its range would be
normalized to.

(Though, even for valid ranges, like "U+2??", I'd probably have it not
round-trip, since it would serialize into "U+200-299".  So that's
probably not actually a big deal.)

>> I doubt we'll *ever* give "U+1?5-300" a valid
>> meaning, because it's a nonsensical range.  As I argued previously at
>> the face-to-face, the only reason that these silly kinds of ranges
>> were *ever* valid is because someone valued terseness over accuracy
>> when writing the regex - it's trivial to make a slightly longer regex
>> that only matches the ranges with sensical syntax.
>
> Actually, the Fonts spec now rejects many more corner cases than it used to
> (eg. decreasing range.) It’s much easier IMO to say "drop the declaration"
> than to try to encode all of these constraints in the tokenizer.
>
> Yes, we could make the token definition more restrictive (and less silly)
> but I think that the added complexity does not buy us anything.

Yeah, as I said (though perhaps not clearly enough), I'm fine with
removing the additional checks that Syntax did to verify that the
token "made sense".  I'm okay with pushing at least that much to the
individual specs that use the token.  (Not happy, but okay with it.)

What I'm against is forcing every use of the token to define how to
*parse* it, and reject nonsensical tokens like "U+1?5-300".  That
particular sequence of characters will *never* be a valid
unicode-range, no matter what we do, or what type of error-recovery a
particular property ends up wanting to define.

In other words:

* "U+9-1" is okay - let's keep that valid at the Syntax level, and let
Fonts deal with it as it wishes.
* "U+1?5" is not okay - let's reject that early, because we know for
certain that it's wrong.
* "U+???" should be transformed into "U+000-999" at the Syntax level,
because that's the way it'll *always* be interpreted, and we shouldn't
force every usage of the token to re-define how to parse a token.  We
should just ensure that every unicode-range is turned into a start
value and an optional end value, with both values being positive
integers.

>> By making Syntax "agnostic" about this, we end up requiring every
>> usage of the token to repeat the exact same parsing/validation logic
>> every time.  This is silly, when we can just bake that in once at the
>> Syntax level,
>
>
> No need to repeat. If we ever need ranges of code points again, the new
> feature can refer to the parsing defined in the Fonts spec. If appropriate,
> we could then move it to the Values & Units spec.

Unless we think there's the faintest possibility of "U+1?5" ever being
considered valid, we should go ahead and do the parsing in the
*parsing* spec.  ^_^

>> unless we really do think accidental usages of that
>> character pattern are something to worry about.
>
>
> I’m not worried about that. All definitions of <unicode-range> we ever had
> require it to start with [uU]+[0-9a-fA-F?], which is pretty characteristic.

Still, though, that character pattern could show up in a base64 value
put directly in a custom property - if it was preceded by a delim
character, it'll parse correctly.  ^_^

~TJ
Received on Monday, 2 September 2013 07:58:16 UTC