Re: [css-syntax] Reverting <unicode-range> changes from CSS 2.1 from Simon Sapin on 2013-09-02 (www-style@w3.org from September 2013)

From: Simon Sapin <simon.sapin@exyr.org>
Date: Mon, 02 Sep 2013 09:52:09 +0100
To: "Tab Atkins Jr." <jackalmage@gmail.com>
CC: www-style <www-style@w3.org>, John Daggett <jdaggett@mozilla.com>
Message-ID: <522451B9.3010505@exyr.org>
Le 02/09/2013 08:57, Tab Atkins Jr. a écrit :
>>> I doubt we'll *ever* give "U+1?5-300" a valid
>>> meaning, because it's a nonsensical range.  As I argued previously at
>>> the face-to-face, the only reason that these silly kinds of ranges
>>> were *ever* valid is because someone valued terseness over accuracy
>>> when writing the regex - it's trivial to make a slightly longer regex
>>> that only matches the ranges with sensical syntax.
>>
>> Actually, the Fonts spec now rejects many more corner cases than it used to
>> (eg. decreasing range.) It’s much easier IMO to say "drop the declaration"
>> than to try to encode all of these constraints in the tokenizer.
>>
>> Yes, we could make the token definition more restrictive (and less silly)
>> but I think that the added complexity does not buy us anything.
>
> Yeah, as I said (though perhaps not clearly enough), I'm fine with
> removing the additional checks that Syntax did to verify that the
> token "made sense".  I'm okay with pushing at least that much to the
> individual specs that use the token.  (Not happy, but okay with it.)
>
> What I'm against is forcing every use of the token to define how to
> *parse* it, and reject nonsensical tokens like "U+1?5-300".  That
> particular sequence of characters will *never* be a valid
> unicode-range, no matter what we do, or what type of error-recovery a
> particular property ends up wanting to define.
>
> In other words:
>
> * "U+9-1" is okay - let's keep that valid at the Syntax level, and let
> Fonts deal with it as it wishes.
> * "U+1?5" is not okay - let's reject that early, because we know for
> certain that it's wrong.
> * "U+???" should be transformed into "U+000-999" at the Syntax level,
> because that's the way it'll *always* be interpreted, and we shouldn't
> force every usage of the token to re-define how to parse a token.  We
> should just ensure that every unicode-range is turned into a start
> value and an optional end value, with both values being positive
> integers.

So, trying to interpret this, you’re proposing to keep "Consume a 
unicode-range token" as it was, but skip the "Set the unicode-range’s 
range" step. To token would have a start and an optional end that are 
both integers. (Or the end could be non-optional, and set to the start 
if not provided in the source.)

Is this correct?

In the ED just before my edits:

https://dvcs.w3.org/hg/csswg/raw-file/aa1b58939f73/css-syntax/Overview.html#consume-a-unicode-range-token
https://dvcs.w3.org/hg/csswg/raw-file/aa1b58939f73/css-syntax/Overview.html#set-the-unicode-ranges-range


If the token’s model is two integers, I think the Fonts spec should be 
changed to define its <urange> in terms of these integers. The current 
definition is based on text, so it’s more consistent with a token 
containing code points.

John, what do you think?


>>> By making Syntax "agnostic" about this, we end up requiring every
>>> usage of the token to repeat the exact same parsing/validation logic
>>> every time.  This is silly, when we can just bake that in once at the
>>> Syntax level,
>>
>>
>> No need to repeat. If we ever need ranges of code points again, the new
>> feature can refer to the parsing defined in the Fonts spec. If appropriate,
>> we could then move it to the Values & Units spec.
>
> Unless we think there's the faintest possibility of "U+1?5" ever being
> considered valid, we should go ahead and do the parsing in the
> *parsing* spec.  ^_^

I don’t think everything parsing-related *has* to be in the Syntax spec. 
We already have lots parsing definitions in other specs for individual 
properties, Selectors, etc.

In this case I still believe it doesn’t buy us anything, but I’m not 
against doing a bit more than CSS 2.1 in Syntax. See above.


>>> unless we really do think accidental usages of that
>>> character pattern are something to worry about.
>>
>> I’m not worried about that. All definitions of <unicode-range> we ever had
>> require it to start with [uU]+[0-9a-fA-F?], which is pretty characteristic.
>
> Still, though, that character pattern could show up in a base64 value
> put directly in a custom property - if it was preceded by a delim
> character, it'll parse correctly.  ^_^

That’s a separate but interesting question. What can go wrong if authors 
expect random text to round-trip through Custom Properties parsing and 
serialization? (Not just with <unicode-range>.)

-- 
Simon Sapin
Received on Monday, 2 September 2013 08:52:32 UTC