(unknown charset) [CSS21] Identifier prose (4.1.3)

Section 4.1.3 of the CSS 2.1 specification [D]:

"In CSS 2.1, identifiers (including element names, classes, and IDs in 
selectors) can contain only the characters [A-Za-z0-9] and ISO 10646 
characters 161 and higher, plus the hyphen (-) and the underscore (_); 
they cannot start with a hyphen or a digit. They can also contain 
escaped characters and any ISO 10646 character as a numeric code (see 
next item). For instance, the identifier "B&W?" may be written as 
"B\&W\?" or "B\26 W\3F".

"Note that Unicode is code-by-code equivalent to ISO 10646 (see 
[UNICODE] and [ISO10646])."

I assume, because a comprehensive formal grammar is available nearby, 
that this passage is for primarily for non-technical readers. If so, 
there are a couple of problems. The construct "[A-Za-z0-9]" will look 
odd even if understood. The mention of "ISO 10646" is likely to mean 
nothing to many non-technical readers.

ID selectors are HASH tokens. HASH tokens may have an unescaped digit 
immediately following the number sign, as in {#1name}. If the prose is 
not meant to constrain the syntax, the prose requires a change. If the 
prose is meant to constrain the syntax beyond what the grammar allows, 
I have to ask: what special meaning would the Working Group (now or in 
the future) like to assign to HASH tokens that start with digits and 
that are in selectors? If no distinct meaning is intended, there is no 
point in constraining the syntax.

Referring to characters by decimal numbers is odd, given the prevalence 
and official use of hexadecimal notation. If intended as a sop to 
non-technical readers, I think that the effort is mostly wasted. When 
trying to discover what "characters 161 and higher" means in practical 
terms, non-technical readers will probably have to translate to 
hexadecimal numbers as they sift through, for example, the Unicode 
Consortium's documents.

The exclusion of the C1 control characters is not reflected in the 
grammar. Which is authoritative: prose or grammar? How do existing 
implementations handle the presence of C1 control characters?

The character called a hyphen in the passage is actually a 
hyphen-minus, U+002D. The true and unambiguous hyphen, U+2010, is 
distinct in the Universal Character Set. I mean, the One True Hyphen, 
praise be unto it.

There are many digits which may start identifiers. Examples include 
Thai, Arabic, Devanagari... The only digits excluded are the ones 
present in ASCII, 0Ý9 (U+0030ÝU+0039).

That identifiers can contain "any ISO 10646 character as a numeric 
code" is true only because ISO 10646 has not yet assigned characters 
beyond U+FFFFFF. From its inception, ISO 10646 has had a range of 
codepoints up to U+7FFFFFFF. Current CSS, with a limit of six 
hexadecimal digits in numeric escapes, can represent only codepoints 
U+0000ÝU+FFFFFF. It would be best to specify the range as such or to 
say that CSS can represent any Unicode character (which would be 
U+0000ÝU+10FFFF).

"B&W?" is not an identifier. It is a sequence of characters that an 
identifier using proper mechanisms can represent.

On the whole, the utility of the passage in its present state and 
position is dubious. If the Working Group feels a need to publish prose 
explaining the syntax, I would suggest that such prose reside in a 
non-normative appendix or even a separate Technical Report (such as a 
Note). Links from section 4 would be appropriate, of course. That said, 
the following revision attempts to make the best in-place change to the 
passage:

"In CSS 2.1, identifiers (including element names and class names) can 
contain only AÝZ, aÝz, 0Ý9 , the hyphen-minus (U+002D), the underscore 
("_", U+005F) and characters U+00A1ÝU+FFFFFD in the ISO 10646 standard 
[ISO10646]; they cannot start with a hyphen-minus or a digit 0Ý9. 
Escapes, whether in numeric or direct form, may appear anywhere in an 
identifier, including at the start. So the identifier corresponding to 
"B&W?" may be written as "B\&W\?" or "B\26 W\3F".

"Note: The hyphen-minus (U+002D) is sometimes called a hyphen, a minus 
sign, or a dash. In the limited character set of ASCII, it had to fill 
all these roles. With the creation of the Unicode Standard [UNICODE] 
and ISO 10646 [ISO10646], there are now unambiguous coded characters 
for hyphen (U+2010), minus sign (U+2212) and various dashes including 
en dash (U+2013) and em dash (U+2014). The hyphen-minus cannot start an 
identifier, but the unambiguous characters can."


[D]
Bert Bos; and others.
"Characters and case", section 4.1.3 of the CSS 2.1 specification.
15 September 2003.
W3C Last Call Working Draft (work in progress).
<http://www.w3.org/TR/2003/WD-CSS21-20030915/syndata.html#q6>.

Û

Received on Friday, 28 November 2003 18:40:44 UTC