W3C home > Mailing lists > Public > www-international@w3.org > October to December 2003

(unknown charset) Re: UTF-8 signature / BOM in CSS

From: (unknown charset) Etan Wexler <ewexler@stickdog.com>
Date: Sat, 6 Dec 2003 19:08:13 -0800
To: (unknown charset) Chris Lilley <chris@w3.org>, www-international@w3.org, w3c-css-wg@w3.org, www-style@w3.org
Message-Id: <9884A7A2-2862-11D8-9BCD-000502CB1B77@stickdog.com>

Chris Lilley wrote to <mailto:www-international@w3.org>, 
<mailto:w3c-css-wg@w3.org>, <mailto:w3c-i18n-ig@w3.org>, and 
<mailto:www-style@w3.org> on 6 December 2003 in "Re: UTF-8 signature / 
BOM in CSS" (<mid:862788409.20031206164822@w3.org>):

> EW> I assumed that the CSS engine would make use of out-of-band 
> information
> EW> to indicate the detected encoding scheme.
>
> Please check the definition of that out of band information [ in]
> particular what it says about when a BOM must be present.

Perhaps I was unclear. I did not mean that the CSS engine would 
propagate the "charset" parameter's value unmodified. What I had in 
mind is as follows.

The CSS engine retrieves a style sheet. It could be from HTTP, the 
local file system, FTP, SMTP + MIME, a database, or any source, really. 
The CSS engine detects an encoding scheme according to the prescribed 
or accepted best practice. Factors that could determine the detection 
include a "charset" parameter, a byte-order mark (U+FEFF), a database 
schema, a file name extension, and the native byte order of the local 
machine. Once the encoding scheme is detected, it is noted for further 
use. The encoding scheme will never be noted as UTF-16 or UTF-32.

There is no encoding scheme UTF-16. There is a "charset" value in the 
IANA registry called "UTF-16", but UTF-16 is an encoding form. Any 
serialized UTF-16 document is either big-endian or little-endian. 
Nevertheless, the "UTF-16" label is allowed and in use, so we resort to 
the BOM to disambiguate the label's meaning. The situation with UTF-32 
is analogous.

We return to our CSS scenario. The encoding scheme has been detected 
and noted, including any significant endian-ness. No BOM is necessary 
for the tokenizer. The BOM is stripped from the internal representation 
of the style sheet. The remaining byte stream moves along to the 
tokenizer. The tokenizer consults the noted encoding scheme in order to 
properly interpret the bytes. Parsing proceeds on course.

What did I miss or misunderstand?

-- 
Etan Wexler.

Received on Saturday, 6 December 2003 22:07:29 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:03 GMT