Re: UTF-8 signature / BOM in CSS from Etan Wexler on 2003-12-05 (www-style@w3.org from December 2003)

From: Etan Wexler <ewexler@stickdog.com>
Date: Fri, 5 Dec 2003 13:30:37 -0800
To: Tex Texin <tex@i18nguy.com>
Cc: Richard Ishida <ishida@w3.org>, www-international@w3.org, w3c-css-wg@w3.org, w3c-i18n-ig@w3.org, www-style@w3.org
Message-Id: <44948884-276A-11D8-9E7C-000502CB1B77@stickdog.com>

Tex Texin wrote to>, <mailto:www-international@w3.org>, 
<mailto:w3c-css-wg@w3.org>, <mailto:w3c-i18n-ig@w3.org>, and 
<mailto:www-style@w3.org> on 2 December 2003 in "Re: UTF-8 signature / 
BOM in CSS" (<mid:3FCD6609.7C5A8F4F@i18nguy.com>):

> I am not sure I would agree with stripping non-characters. I would
> rather reject documents with junk in them than silently clean them up.

I used to be of the junk-rejection mentality. Ian Hickson, time, and 
probably some brain-altering medication have convinced me of the case 
for parsing at all costs.

> In the case of the UTF-8 BOM, I would not object to simply stripping 
> it,
> but it does seem odd to not make use of the information about the
> document's encoding and odder still to not use the information about
> endian-ness in a UTF-16 encoded document.

I assumed that the CSS engine would make use of out-of-band information 
to indicate the detected encoding scheme. Or the CSS engine would 
internally convert style sheets' encodings to a single chosen encoding 
(say, UTF-16BE). Or the CSS engine would parse bytes into 
encoding-independent character objects. The CSS engine would then pass 
these character objects to the tokenizer, with the original encoding 
scheme becoming irrelevant to understanding the CSS.

> Also stripping it in the case
> of UTF-16 would eliminate useful information from a CSS document.

I intend to strip the BOM only for internal purposes. That is, the CSS 
engine strips the BOM from the style sheet in order to normalize the 
input fed to the tokenizer. As stated, the encoding scheme should 
either be flagged out of band or be made irrelevant by the particulars 
of the implementation. Interfacing with the outside world has separate 
concerns, one of which is to preserve any encoding signature. Again, 
the BOM is not necessary internally. A CSS editor could note the 
encoding scheme of a style sheet, work with the style sheet's 
constructs, and add appropriate encoding signatures upon serialization.

> To answer your question about other BOMs, they are all based on U+FEFF,
> but they exist for
> UTF-16, UTF-32, and SCSU (Unicode compression).
>
> I have a list with more detail here:
> http://www.i18nguy.com/unicode/c-unicode.html#BOM
>
> and the Unicode Consortium has a FAQ on UTF-8 and the BOM at:
> http://www.unicode.org/faq/utf_bom.html

They're not just based on U+FEFF, they are U+FEFF. There are various 
byte sequences, yes, but each encodes the same character.

-- 
Etan Wexler.

Received on Friday, 5 December 2003 16:30:08 UTC