- From: Zack Weinberg <zweinberg@mozilla.com>
- Date: Wed, 24 Sep 2008 16:24:15 -0700
- To: www-style@w3.org
There are a number of ambiguities in the specification of unicode-range: descriptors and UNICODE-RANGE tokens. Most are relevant only to css3-fonts, but two relate to the core syntax and are therefore relevant to css2.1 as well. The regular expression defining UNICODE-RANGE in CSS2.1 is U\+[0-9a-f?]{1,6}(-[0-9a-f]{1,6})? Core syntax issue 1 (editorial, one hopes): The initial U is in upper case. All other core lexical productions are written entirely in lower case. 4.1.3 bullet point 1 assures us that CSS is entirely case- insensitive; I am assuming this is not a (unique) exception to that rule. For consistency, the U should be changed to lower case. If it *is* meant to be an exception, there should be explicit wording in both css-2.1 and css3-fonts that says so. Possible core syntax issue 2: This regular expression will match two classes of token which do not conform to any of the three basic forms called out in the current ED of css3-fonts: U+1?10 question marks are not (all) trailing U+A?-BF both trailing question marks and a second endpoint I believe it is not possible to exclude all tokens in these classes, and still express all the existing constraints on UNICODE-RANGE tokens, using only Lex-style regular expression productions; in particular, it is not simultaneously possible to limit the first number to no more than 6 characters and specify that all question marks must trail. So I recommend that the core syntax be left alone here. Instead, css3-fonts should say that any UNICODE-RANGE token that does not fit one of the three basic forms triggers a parse error (thus, the entire descriptor is discarded). [Aside: css3-fonts is almost entirely lacking in formal grammar rules. It would be nice if they got added.] ---- There are several varieties of "semantically dubious" UNICODE-RANGE tokens: while conforming to one of the three basic forms, their effect is not fully specified. These are issues only for css3-fonts, which should specify the effect of all possible range tokens. 1) Nothing but question marks: U+???, U+????, etc. These could be considered parse errors, or they could be treated as zero-padded at the left, so equivalent to U+0???, U+0????, etc. I have no preference. 2) Redundant second number: U+0100-0100 e.g. I recommend these be treated as equivalent to the single-number form, U+0100 e.g. (This would license a tool for generating @font-face blocks to be "lazy" and write out all ranges in U+xxxx-yyyy format even when the range is a single character.) 3) Descending range: U+00FF-0000 e.g. These should either be ignored or be parse errors. I have no preference. 4a) Range that straddles the high end of Unicode: U+0F0000-12FFFF e.g. 4b) Range entirely outside Unicode: U+A00000-A0FFFF e.g. I recommend that the former be clipped to the Unicode range and the latter ignored, silently. I also recommend a note saying that future revisions of CSS may enlarge the range. (I think 2^20 characters will, one day, not be enough ... I'd actually like to see the core syntax allow eight digits in each number, but we don't need to go there right now.) 5) Overlapping or redundant multiple ranges: unicode-range: U+00??, U+0080-01FF; unicode-range: U+2A00-2AFF, U+2A34; I recommend that the meaning of a unicode-range: descriptor be specified to be the set-theoretic union of all the ranges listed; the above examples would be equivalent to unicode-range: U+0000-01FF; unicode-range: U+2A??; respectively. ---- There is also a question of what text is produced by a CSSOM query for the value of an arbitrary unicode-range: descriptor. I recommend that implementations be allowed, but not required, to produce a simplified representation of the range instead of the original text. Continuing with the example of unicode-range: U+00??, U+0080-01FF; an implementation should be allowed to produce (at least) any of these: U+00??, U+0080-01FF; // exactly the original text U+0000-00FF, U+0080-01FF; // question marks expanded to pairs U+00??, U+01??; // normalized to question mark form U+0000-00FF, U+0100-01FF; // normalized to pair form U+0000-01FF; // optimized I don't think the spec needs to enumerate possibilities; just mention that implementations have license in this area. I would be happy to come up with wording for any or all of the above changes. zw
Received on Wednesday, 24 September 2008 23:25:00 UTC