- From: (unknown charset) Etan Wexler <ewexler@stickdog.com>
- Date: Fri, 28 Nov 2003 15:32:31 -0800
- To: (unknown charset) www-style@w3.org, Bert Bos <bert@w3.org>, Tantek Çelik <tantek@cs.stanford.edu>, Ian Hickson <ian@hixie.ch>, Håkon Wium Lie <howcome@opera.com>
Section 4.1.3 of the CSS 2.1 specification [D]:
"In CSS 2.1, identifiers (including element names, classes, and IDs in
selectors) can contain only the characters [A-Za-z0-9] and ISO 10646
characters 161 and higher, plus the hyphen (-) and the underscore (_);
they cannot start with a hyphen or a digit. They can also contain
escaped characters and any ISO 10646 character as a numeric code (see
next item). For instance, the identifier "B&W?" may be written as
"B\&W\?" or "B\26 W\3F".
"Note that Unicode is code-by-code equivalent to ISO 10646 (see
[UNICODE] and [ISO10646])."
I assume, because a comprehensive formal grammar is available nearby,
that this passage is for primarily for non-technical readers. If so,
there are a couple of problems. The construct "[A-Za-z0-9]" will look
odd even if understood. The mention of "ISO 10646" is likely to mean
nothing to many non-technical readers.
ID selectors are HASH tokens. HASH tokens may have an unescaped digit
immediately following the number sign, as in {#1name}. If the prose is
not meant to constrain the syntax, the prose requires a change. If the
prose is meant to constrain the syntax beyond what the grammar allows,
I have to ask: what special meaning would the Working Group (now or in
the future) like to assign to HASH tokens that start with digits and
that are in selectors? If no distinct meaning is intended, there is no
point in constraining the syntax.
Referring to characters by decimal numbers is odd, given the prevalence
and official use of hexadecimal notation. If intended as a sop to
non-technical readers, I think that the effort is mostly wasted. When
trying to discover what "characters 161 and higher" means in practical
terms, non-technical readers will probably have to translate to
hexadecimal numbers as they sift through, for example, the Unicode
Consortium's documents.
The exclusion of the C1 control characters is not reflected in the
grammar. Which is authoritative: prose or grammar? How do existing
implementations handle the presence of C1 control characters?
The character called a hyphen in the passage is actually a
hyphen-minus, U+002D. The true and unambiguous hyphen, U+2010, is
distinct in the Universal Character Set. I mean, the One True Hyphen,
praise be unto it.
There are many digits which may start identifiers. Examples include
Thai, Arabic, Devanagari... The only digits excluded are the ones
present in ASCII, 0Ý9 (U+0030ÝU+0039).
That identifiers can contain "any ISO 10646 character as a numeric
code" is true only because ISO 10646 has not yet assigned characters
beyond U+FFFFFF. From its inception, ISO 10646 has had a range of
codepoints up to U+7FFFFFFF. Current CSS, with a limit of six
hexadecimal digits in numeric escapes, can represent only codepoints
U+0000ÝU+FFFFFF. It would be best to specify the range as such or to
say that CSS can represent any Unicode character (which would be
U+0000ÝU+10FFFF).
"B&W?" is not an identifier. It is a sequence of characters that an
identifier using proper mechanisms can represent.
On the whole, the utility of the passage in its present state and
position is dubious. If the Working Group feels a need to publish prose
explaining the syntax, I would suggest that such prose reside in a
non-normative appendix or even a separate Technical Report (such as a
Note). Links from section 4 would be appropriate, of course. That said,
the following revision attempts to make the best in-place change to the
passage:
"In CSS 2.1, identifiers (including element names and class names) can
contain only AÝZ, aÝz, 0Ý9 , the hyphen-minus (U+002D), the underscore
("_", U+005F) and characters U+00A1ÝU+FFFFFD in the ISO 10646 standard
[ISO10646]; they cannot start with a hyphen-minus or a digit 0Ý9.
Escapes, whether in numeric or direct form, may appear anywhere in an
identifier, including at the start. So the identifier corresponding to
"B&W?" may be written as "B\&W\?" or "B\26 W\3F".
"Note: The hyphen-minus (U+002D) is sometimes called a hyphen, a minus
sign, or a dash. In the limited character set of ASCII, it had to fill
all these roles. With the creation of the Unicode Standard [UNICODE]
and ISO 10646 [ISO10646], there are now unambiguous coded characters
for hyphen (U+2010), minus sign (U+2212) and various dashes including
en dash (U+2013) and em dash (U+2014). The hyphen-minus cannot start an
identifier, but the unambiguous characters can."
[D]
Bert Bos; and others.
"Characters and case", section 4.1.3 of the CSS 2.1 specification.
15 September 2003.
W3C Last Call Working Draft (work in progress).
<http://www.w3.org/TR/2003/WD-CSS21-20030915/syndata.html#q6>.
Û
Received on Friday, 28 November 2003 18:40:44 UTC