Re: UTF-8 signature / BOM in CSS

One correction.

>There is no encoding scheme UTF-16.

The string "UTF-16" names both an encoding form and an encoding scheme. Somewhat
unfortunate, but it is that way for historical reasons. As an encoding form, it
is independent of byte ordering. As an encoding scheme, it is defined such that
an (optional) BOM determines the interpretation of the rest of the bytes, as
either pairs of big-endian or little-endian bytes.

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message ----- 
From: "Etan Wexler" <ewexler@stickdog.com>
To: "Chris Lilley" <chris@w3.org>; <www-international@w3.org>;
<w3c-css-wg@w3.org>; <www-style@w3.org>
Sent: Sat, 2003 Dec 06 19:08
Subject: Re: UTF-8 signature / BOM in CSS



Chris Lilley wrote to <mailto:www-international@w3.org>,
<mailto:w3c-css-wg@w3.org>, <mailto:w3c-i18n-ig@w3.org>, and
<mailto:www-style@w3.org> on 6 December 2003 in "Re: UTF-8 signature /
BOM in CSS" (<mid:862788409.20031206164822@w3.org>):

> EW> I assumed that the CSS engine would make use of out-of-band
> information
> EW> to indicate the detected encoding scheme.
>
> Please check the definition of that out of band information [— in]
> particular what it says about when a BOM must be present.

Perhaps I was unclear. I did not mean that the CSS engine would
propagate the "charset" parameter's value unmodified. What I had in
mind is as follows.

The CSS engine retrieves a style sheet. It could be from HTTP, the
local file system, FTP, SMTP + MIME, a database, or any source, really.
The CSS engine detects an encoding scheme according to the prescribed
or accepted best practice. Factors that could determine the detection
include a "charset" parameter, a byte-order mark (U+FEFF), a database
schema, a file name extension, and the native byte order of the local
machine. Once the encoding scheme is detected, it is noted for further
use. The encoding scheme will never be noted as UTF-16 or UTF-32.

There is no encoding scheme UTF-16. There is a "charset" value in the
IANA registry called "UTF-16", but UTF-16 is an encoding form. Any
serialized UTF-16 document is either big-endian or little-endian.
Nevertheless, the "UTF-16" label is allowed and in use, so we resort to
the BOM to disambiguate the label's meaning. The situation with UTF-32
is analogous.

We return to our CSS scenario. The encoding scheme has been detected
and noted, including any significant endian-ness. No BOM is necessary
for the tokenizer. The BOM is stripped from the internal representation
of the style sheet. The remaining byte stream moves along to the
tokenizer. The tokenizer consults the noted encoding scheme in order to
properly interpret the bytes. Parsing proceeds on course.

What did I miss or misunderstand?

-- 
Etan Wexler.


Received on Sunday, 7 December 2003 15:39:43 UTC