- From: Etan Wexler <ewexler@stickdog.com>
- Date: Sun, 7 Dec 2003 17:34:10 -0800
- To: François Yergeau <francois@yergeau.com>, Mark Davis <mark.davis@jtcsv.com>, www-style@w3.org
- Cc: w3c-i18n-ig@w3.org
François Yergeau wrote to <mailto:w3c-i18n-ig@w3.org> on 7 December 2003 in "Re: UTF-8 signature / BOM in CSS" (<mid:3FD3AE62.8040201@yergeau.com>): > Etan Wexler a écrit : >> Is the BOM to be considered an identifier character? That's possible. > > Do you mean part of the class of characters acceptable in identifiers? Yes. The codepoint U+FEFF is currently allowed in and at the start of identifiers in the CSS 2 Recommendation, the CSS 2.1 Working Draft, and the CSS3 syntax module Working Draft. The latter has a token type "BOM", but since the "BOM" production comes after the "IDENT" production, a U+FEFF codepoint would always end up as an "IDENT" or part of an "IDENT". > That looks like a bad idea. In retrospect, it was a mistake in > Unicode to designate for the BOM function a code point that is also a > legitimate character (the zero-width non-breaking space, ZWNBSP). I always thought so, but assumed that I was too dull to understand the expert opinion. > Unicode has tried to minimize the damage by allocating another > character (U+2060 word joiner) for the legitimate uses of ZWNBSP and > to deprecate the latter for any use except as a signature. Maybe I'm not so dull, after all. > It therefore sounds unadvisable to admit it as an identifier > character, a somewhat distinguished class. Identifier characters aren't very distinguished in CSS. Besides selected characters from the ASCII range, any character in the range U+00A1 to U+10FFFF is legitimate. In CSS 2 and the CSS 2.1 drafts, there isn't even an explicit exclusion of noncharacters. A reader could imply the exclusion, as I do, since CSS specifications' prose talks of dealing with characters. > The BOM is a non-breaking space, quite the opposite of a separator. > > The BOM is really in a class in itself, my proposal was to name it > explicitly in the grammar, appearing only at the very start of the > stylesheet. What happens when a tokenizer finds a U+FEFF somewhere else in a style sheet? The codepoint may be invalid there, granted, but the direction that the CSS Working Group is heading is to specify error handling for every scenario. If we accept Chris Lilley's assertion that U+FEFF is not a character, stripping occurrences of U+FEFF before tokenization seems very reasonable. If U+FEFF is a character (and I don't care to enter that theological debate), stripping it may still be the sensible option. What's the Yergeau recommendation? The Davis recommendation? -- Etan Wexler.
Received on Sunday, 7 December 2003 20:33:54 UTC