Re: [CSS 2.1] Scope of identifier definition in Section 4.1.3

On Wednesday 12 March 2008 23:03, Benjamin Hawkes-Lewis wrote:
> The prose description of identifiers in the CSS 2.1 specification 
says:
> > In CSS, identifiers (including element names, classes, and IDs in
> > selectors) can contain only the characters [a-z0-9] and ISO 10646
> > characters U+00A1 and higher, plus the hyphen (-) and the
> > underscore (_); they cannot start with a digit, or a hyphen
> > followed by a digit. Identifiers can also contain escaped
> > characters and any ISO 10646 character as a numeric code (see next
> > item). For instance, the identifier "B&W?" may be written as
> > "B\&W\?" or "B\26 W\3F".
>
> http://www.w3.org/TR/CSS21/syndata.html#value-def-identifier
>
> The next definition begins:
> > In CSS 2.1, a backslash (\) character indicates three types of
> > character escapes.
>
> It would have been helpful to this reader, at least, if it were
> equally clear that the prose was talking only about identifiers in
> CSS 2.1 not "CSS" generally, where according to the tokenization
> rules identifiers may contain characters of octal 200 (U+0080) and
> higher (i.e. a substantially wider set):
>
> http://www.w3.org/TR/CSS21/syndata.html#tokenization

It seems you actually found an error in the text that nobody saw before, 
though the reason for the error is different from what you assumed. 
It's not a difference between generic CSS and CSS 2.1. The syntax of 
identifiers is meant to be the same in all levels.

The generic syntax for CSS is trying to say that nothing outside the 
ASCII range is ever going to have a special function in CSS. All 
punctuation (such as curly braces and semicolons) is taken from the 
ASCII range. Section 4.1.1 says it in octal: nonascii is everything 
above 0177 (127 in decimal). Section 4.1.3 says it in hexadecimal: A1 
and higher (161 in decimal).

So are Unicode characters between 127 and 161 allowed or not?

Well, when this text was first written, Unicode was still at version 1 
and there *were* no characters between 127 and 160. The first actual 
non-ASCII character was at 160 (the non-breakable space).

I think that's why section 4.1.3 says U+00A1 (161). It just tried to be 
helpful. Although I don't understand why it says A1 and not A0. The 
non-breakable space has no special function in CSS, so why exclude it?

Unicode is now at version 5 and it filled the gap between 127 and 160 
with actual characters. They are "control characters" like "cancel 
character" and "reverse line feed," i.e., not things that you can see 
or type in a typical editor, but a creative user could probably find a 
way to put them in a CSS file anyway. And thus section 4.1.3 needs to 
include them.

So I think we need this fix:

    In section 4.1.3, second bullet, replace "U+00A1" by "U+0080".


There is another point to your e-mail. You believed that there was a 
difference between generic CSS and CSS 2.1, because the third bullet 
point says "CSS 2.1" while the others say just "CSS."

I can see that that is confusing. I think we, the editors, read these 
texts too often one line at a time, to see if that line on its own is 
correct. But if you read the lines in sequence, they indeed *suggest* 
that there is one rule for CSS 2.1 and another for CSS in general. 

That third bullet point is strictly speaking correct. The backslash 
works as described in CSS 2.1 and that's all that this spec needs to 
define. But it actually also works like that in other levels of CSS and 
it is less confusing if we say so. So I think we also should change:

    In section 4.1.3, third bullet, replace the first "CSS 2.1"
    by "CSS".


For reference: this issue will be tracked as issue 57 at
http://csswg.inkedblade.net/spec/css2.1#issue-57



Bert

PS. I tested what browsers do, and, unfortunately, it seems that Firefox 
(version 2.0.0.14) does what 4.1.3 says: U+80 until U+00A0 cannot occur 
in identifiers. Opera and Konqueror do what 4.1.1 says: anything above 
U+7F can be in an identifier. Attached is a test case with a 
non-breakable space (U+A0) and another with a reverse line feed (U+8D).

Let's hope they survive e-mail encoding and decoding...

-- 
  Bert Bos                                ( W 3 C ) http://www.w3.org/
  http://www.w3.org/people/bos                               W3C/ERCIM
  bert@w3.org                             2004 Rt des Lucioles / BP 93
  +33 (0)4 92 38 76 92            06902 Sophia Antipolis Cedex, France

Received on Thursday, 3 July 2008 18:14:39 UTC