Re: UTF-8 signature / BOM in CSS from François Yergeau on 2003-12-11 (www-international@w3.org from October to December 2003)

From: François Yergeau <francois@yergeau.com>
Date: Thu, 11 Dec 2003 10:40:34 -0500
To: Bert Bos <bert@w3.org>
Cc: www-international@w3.org, www-style@w3.org
Message-id: <3FD88FF2.2010203@yergeau.com>
Bert Bos a écrit :
> I've written some new text for section 4.4 of CSS 2.1[1]. 
> [1] http://www.w3.org/TR/CSS21/syndata.html#q23
> ...
>        1. An HTTP "charset" parameter in a "Content-Type" field.
> 
>        2. The @charset at-rule.
> 
>        3. Mechanisms of the language of the referencing document
>           (e.g., in HTML, the "charset" attribute of the LINK
>           element).
> 
>   |    4. UA-dependent mechanisms (e.g., guessing based on the BOM)

That's not good, the BOM belongs in 2, along with @charset.  Both are of 
the same nature: in-band identification of the character encoding.  Both 
are equally valid ways to do this (but the BOM is limited to Unicode 
encodings).  Using the BOM to identify encoding is not a guess any more 
than using @charset is.  It should not be UA-dependent any more than 
@charset.

Oh, and in 1 it should be a little wider than just HTTP: there's also 
HTTPS, multipart mail with MIME headers, other similar things possibly 
now and almost certainly in the future.  I recently suggested using 
"external character encoding information (such as MIME or HTTP 
headers)", slightly adapted from the XML spec.

>     At most one @charset rule may appear in an external style sheet
>   | and it must appear at the very start of the document, not preceded
>   | by any characters, except possibly a "BOM" (see below). Any other
>   | @charset rules must be ignored by the UA.

That's good.  I guess you did not like my suggestion of integrating the 
BOM in the grammar instead of discussing it in the prose?

>     This specification does not mandate which character encodings a
>     user agent must support.

It should (UTF-8, UTF-16).  Perhaps CSS3 will?

>   | If an external style sheet has U+FEFF ("zero width non-breaking
>   | space") as the first character (i.e., even before any @charset
>   | rule), this character is interpreted as a so-called "Byte Order
>   | Mark" (BOM), as follows:
>   |
>   |   - If the style sheet is encoded as "UTF-16" [RFC2781] or
>   |     "UTF-32" [UNICODE], the BOM determines the byte order
>   |     ("big-endian" or "little-endian") as explained in the cited
>   |     RFC. If the style sheet is encoded as anything else, the
>   |     U+FEFF character is ignored.

This is the wrong way around, IMHO.  If a UTF-16(BE|LE) BOM is found, 
then the encoding is determined to be UTF-16(BE|LE).  Same for UTF-32 
and UTF-8.  U+FEFF is the UCS signature and has been since the first 
edition of ISO 10646 in 1993.  Its function is to indicate that the text 
is in Unicode and to tell in which particular encoding scheme of 
Unicode, including byte order in the case of the multibyte encodings. 
The above makes too much of the BOM moniker, which is only a moniker; 
it's a signature, even in UTF-8 where the byte order aspect is non sequitur.

>   |   - An external style sheet should start with a BOM if it is
>   |     encoded as "UTF-16" or "UTF-32" and should not have a BOM in
>   |     any other encodings.

Add UTF-8.  The UTF-8 signature has been standardized since UTF-8 has 
been introduced in the standard in 1994 or thereabouts and is a UCS 
signature just like the others.

>   | Note that the BOM can only be ignored if it agrees with the
>   | encoding. E.g., if a style sheet encoded as "UTF-8" starts with
>   | 0xEF 0xBB 0xBF those three bytes are ignored, since they correctly
>   | encode the character U+FEFF in UTF-8. But if a style sheet encoded
>   | as "ISO-8859-1" starts with the two bytes 0xFE 0xFF (the BOM for
>   | big-endian UTF-16), the two bytes are simply interpreted as the
>   | two characters "þ" and "ÿ".

That's a bit confusing.  Normally the BOM serves to identify the 
encoding and finding 0xFE 0xFF will tell you that the style sheet is in 
UTF-16BE, not in ISO-8859-1.  If you want to say that the ss was 
identified to be ISO-8859-1 before seeing the BOM (e.g. by the HTTP 
charset), then just say so, to be clear.

> It's a mess :-( Is there no way to forbid both the @charset and the
> BOM in CSS?

Yes: mandate that all style sheets must be in UTF-8 and be done with it :-)

-- 
François
Received on Thursday, 11 December 2003 10:42:21 UTC