- From: Etan Wexler <ewexler@stickdog.com>
- Date: Fri, 5 Dec 2003 13:30:37 -0800
- To: Tex Texin <tex@i18nguy.com>
- Cc: Richard Ishida <ishida@w3.org>, www-international@w3.org, w3c-css-wg@w3.org, w3c-i18n-ig@w3.org, www-style@w3.org
Tex Texin wrote to>, <mailto:www-international@w3.org>, <mailto:w3c-css-wg@w3.org>, <mailto:w3c-i18n-ig@w3.org>, and <mailto:www-style@w3.org> on 2 December 2003 in "Re: UTF-8 signature / BOM in CSS" (<mid:3FCD6609.7C5A8F4F@i18nguy.com>): > I am not sure I would agree with stripping non-characters. I would > rather reject documents with junk in them than silently clean them up. I used to be of the junk-rejection mentality. Ian Hickson, time, and probably some brain-altering medication have convinced me of the case for parsing at all costs. > In the case of the UTF-8 BOM, I would not object to simply stripping > it, > but it does seem odd to not make use of the information about the > document's encoding and odder still to not use the information about > endian-ness in a UTF-16 encoded document. I assumed that the CSS engine would make use of out-of-band information to indicate the detected encoding scheme. Or the CSS engine would internally convert style sheets' encodings to a single chosen encoding (say, UTF-16BE). Or the CSS engine would parse bytes into encoding-independent character objects. The CSS engine would then pass these character objects to the tokenizer, with the original encoding scheme becoming irrelevant to understanding the CSS. > Also stripping it in the case > of UTF-16 would eliminate useful information from a CSS document. I intend to strip the BOM only for internal purposes. That is, the CSS engine strips the BOM from the style sheet in order to normalize the input fed to the tokenizer. As stated, the encoding scheme should either be flagged out of band or be made irrelevant by the particulars of the implementation. Interfacing with the outside world has separate concerns, one of which is to preserve any encoding signature. Again, the BOM is not necessary internally. A CSS editor could note the encoding scheme of a style sheet, work with the style sheet's constructs, and add appropriate encoding signatures upon serialization. > To answer your question about other BOMs, they are all based on U+FEFF, > but they exist for > UTF-16, UTF-32, and SCSU (Unicode compression). > > I have a list with more detail here: > http://www.i18nguy.com/unicode/c-unicode.html#BOM > > and the Unicode Consortium has a FAQ on UTF-8 and the BOM at: > http://www.unicode.org/faq/utf_bom.html They're not just based on U+FEFF, they are U+FEFF. There are various byte sequences, yes, but each encodes the same character. -- Etan Wexler.
Received on Friday, 5 December 2003 16:30:08 UTC