[css2.1] [css3-fonts] Ambiguities relating to UNICODE-RANGE tokens from Zack Weinberg on 2008-09-24 (www-style@w3.org from September 2008)

From: Zack Weinberg <zweinberg@mozilla.com>
Date: Wed, 24 Sep 2008 16:24:15 -0700
To: www-style@w3.org
Message-ID: <20080924162415.6336bb4e@trurl>
There are a number of ambiguities in the specification of
unicode-range: descriptors and UNICODE-RANGE tokens.  Most are relevant
only to css3-fonts, but two relate to the core syntax and are therefore
relevant to css2.1 as well.

The regular expression defining UNICODE-RANGE in CSS2.1 is

  U\+[0-9a-f?]{1,6}(-[0-9a-f]{1,6})?

Core syntax issue 1 (editorial, one hopes): The initial U is in upper
case. All other core lexical productions are written entirely in lower
case. 4.1.3 bullet point 1 assures us that CSS is entirely case-
insensitive; I am assuming this is not a (unique) exception to that
rule.  For consistency, the U should be changed to lower case.  If it
*is* meant to be an exception, there should be explicit wording in
both css-2.1 and css3-fonts that says so.

Possible core syntax issue 2: This regular expression will match
two classes of token which do not conform to any of the three
basic forms called out in the current ED of css3-fonts: 

  U+1?10      question marks are not (all) trailing
  U+A?-BF     both trailing question marks and a second endpoint

I believe it is not possible to exclude all tokens in these classes, and
still express all the existing constraints on UNICODE-RANGE tokens,
using only Lex-style regular expression productions; in particular, it
is not simultaneously possible to limit the first number to no
more than 6 characters and specify that all question marks must trail.

So I recommend that the core syntax be left alone here.  Instead,
css3-fonts should say that any UNICODE-RANGE token that does not fit
one of the three basic forms triggers a parse error (thus, the entire
descriptor is discarded).

[Aside: css3-fonts is almost entirely lacking in formal grammar rules.
It would be nice if they got added.]

----

There are several varieties of "semantically dubious" UNICODE-RANGE
tokens: while conforming to one of the three basic forms, their effect
is not fully specified.  These are issues only for css3-fonts, which
should specify the effect of all possible range tokens.

1) Nothing but question marks: U+???, U+????, etc.

These could be considered parse errors, or they could be treated as
zero-padded at the left, so equivalent to U+0???, U+0????, etc.  I have
no preference.

2) Redundant second number: U+0100-0100 e.g.

I recommend these be treated as equivalent to the single-number form,
U+0100 e.g.  (This would license a tool for generating @font-face blocks
to be "lazy" and write out all ranges in U+xxxx-yyyy format even when
the range is a single character.)

3) Descending range: U+00FF-0000 e.g.

These should either be ignored or be parse errors.  I have no
preference.

4a) Range that straddles the high end of Unicode: U+0F0000-12FFFF e.g.
4b) Range entirely outside Unicode: U+A00000-A0FFFF e.g.

I recommend that the former be clipped to the Unicode range and the
latter ignored, silently.  I also recommend a note saying that future
revisions of CSS may enlarge the range.  (I think 2^20 characters will,
one day, not be enough ... I'd actually like to see the core syntax
allow eight digits in each number, but we don't need to go there right
now.)

5) Overlapping or redundant multiple ranges:
     unicode-range: U+00??, U+0080-01FF;
     unicode-range: U+2A00-2AFF, U+2A34;

I recommend that the meaning of a unicode-range: descriptor be specified
to be the set-theoretic union of all the ranges listed; the above examples
would be equivalent to

   unicode-range: U+0000-01FF;
   unicode-range: U+2A??;

respectively.

----

There is also a question of what text is produced by a CSSOM query for
the value of an arbitrary unicode-range: descriptor.  I recommend that
implementations be allowed, but not required, to produce a simplified
representation of the range instead of the original text.  Continuing
with the example of

   unicode-range: U+00??, U+0080-01FF;

an implementation should be allowed to produce (at least) any of these:

   U+00??, U+0080-01FF;      // exactly the original text
   U+0000-00FF, U+0080-01FF; // question marks expanded to pairs
   U+00??, U+01??;           // normalized to question mark form
   U+0000-00FF, U+0100-01FF; // normalized to pair form
   U+0000-01FF;              // optimized

I don't think the spec needs to enumerate possibilities; just mention
that implementations have license in this area.

I would be happy to come up with wording for any or all of the above
changes.

zw
Received on Wednesday, 24 September 2008 23:25:00 UTC