Re: UTF-8 signature / BOM in CSS

François Yergeau wrote to <mailto:w3c-i18n-ig@w3.org> on 7 December 
2003 in "Re: UTF-8 signature / BOM in CSS" 
(<mid:3FD3AE62.8040201@yergeau.com>):

> Etan Wexler a écrit  :
>> Is the BOM to be considered an identifier character? That's possible.
>
> Do you mean part of the class of characters acceptable in identifiers?

Yes. The codepoint U+FEFF is currently allowed in and at the start of 
identifiers in the CSS 2 Recommendation, the CSS 2.1 Working Draft, and 
the CSS3 syntax module Working Draft. The latter has a token type 
"BOM", but since the "BOM" production comes after the "IDENT" 
production, a U+FEFF codepoint would always end up as an "IDENT" or 
part of an "IDENT".

> That looks like a bad idea.  In retrospect, it was a mistake in 
> Unicode to designate for the BOM function a code point that is also a 
> legitimate character (the zero-width non-breaking space, ZWNBSP).

I always thought so, but assumed that I was too dull to understand the 
expert opinion.

> Unicode has tried to minimize the damage by allocating another 
> character (U+2060 word joiner) for the legitimate uses of ZWNBSP and 
> to deprecate the latter for any use except as a signature.

Maybe I'm not so dull, after all.

> It therefore sounds unadvisable to admit it as an identifier 
> character, a somewhat distinguished class.

Identifier characters aren't very distinguished in CSS. Besides 
selected characters from the ASCII range, any character in the range 
U+00A1 to U+10FFFF is legitimate. In CSS 2 and the CSS 2.1 drafts, 
there isn't even an explicit exclusion of noncharacters. A reader could 
imply the exclusion, as I do, since CSS specifications' prose talks of 
dealing with characters.

> The BOM is a non-breaking space, quite the opposite of a separator.
>
> The BOM is really in a class in itself, my proposal was to name it 
> explicitly in the grammar, appearing only at the very start of the 
> stylesheet.

What happens when a tokenizer finds a U+FEFF somewhere else in a style 
sheet? The codepoint may be invalid there, granted, but the direction 
that the CSS Working Group is heading is to specify error handling for 
every scenario. If we accept Chris Lilley's assertion that U+FEFF is 
not a character, stripping occurrences of U+FEFF before tokenization 
seems very reasonable. If U+FEFF is a character (and I don't care to 
enter that theological debate), stripping it may still be the sensible 
option. What's the Yergeau recommendation? The Davis recommendation?

-- 
Etan Wexler.

Received on Sunday, 7 December 2003 20:33:54 UTC