Re: UTF-8 signature / BOM in CSS

Franšois Yergeau writes:
> L. David Baron a Úcrit  :
> >> > EncodingDecl = [BOM][@charset=<foobar>]
> >> >
> >> >with the additional constraint that EncodingDecl must occur at the 
> >> >start of the stylesheet.
> > 
> > I think the main advantage of such a change would be clarity.  (Or is
> > there some other advantage you were thinking of?)
> No, just that: make it explicit.
> >  I agree that it makes
> > it clearer that the BOM is allowed, but it might make it less clear that
> > the processing of the encoding declaration is an entirely separate
> > process from the tokenization and parsing of the stylesheet.
> Hmmm, good point, but isn't it already the case with @charset?

I've written some new text for section 4.4 of CSS 2.1[1]. Here is my
attempt at explaining the BOM. The paragraph after the first list now
mentions that the BOM may occur, even before @charset. And there is a
new section and a new note that detail what UAs and authors have to do
with the BOM. Changed text marked with "|" below.


    4.4 CSS document representation

    A CSS style sheet is a sequence of characters from the Universal
    Character Set (see [ISO10646]). For transmission and storage,
    these characters must be encoded by a character encoding that
    supports the set of characters available in US-ASCII (e.g., ISO
    8859-x, SHIFT JIS, etc.). For a good introduction to character
    sets and character encodings, please consult the HTML 4.0
    specification ([HTML40], chapter 5), See also the XML 1.0
    specification ([XML10], sections 2.2 and 4.3.3, and Appendix F.

    When a style sheet is embedded in another document, such as in the
    STYLE element or "style" attribute of HTML, the style sheet shares
    the character encoding of the whole document.

    When a style sheet resides in a separate file, user agents must
    observe the following priorities when determining a document's
    character encoding (from highest priority to lowest):

       1. An HTTP "charset" parameter in a "Content-Type" field.

       2. The @charset at-rule.

       3. Mechanisms of the language of the referencing document
          (e.g., in HTML, the "charset" attribute of the LINK

  |    4. UA-dependent mechanisms (e.g., guessing based on the BOM)

    At most one @charset rule may appear in an external style sheet
  | and it must appear at the very start of the document, not preceded
  | by any characters, except possibly a "BOM" (see below). Any other
  | @charset rules must be ignored by the UA.

    After "@charset", authors specify the name of a character
    encoding. The name must be a charset name as described in the IANA
    registry (See [IANA]. Also, see [CHARSETS] for a complete list of
    charsets). For example:

        @charset "ISO-8859-1";

    This specification does not mandate which character encodings a
    user agent must support.

  | If an external style sheet has U+FEFF ("zero width non-breaking
  | space") as the first character (i.e., even before any @charset
  | rule), this character is interpreted as a so-called "Byte Order
  | Mark" (BOM), as follows:
  |   - If the style sheet is encoded as "UTF-16" [RFC2781] or
  |     "UTF-32" [UNICODE], the BOM determines the byte order
  |     ("big-endian" or "little-endian") as explained in the cited
  |     RFC. If the style sheet is encoded as anything else, the
  |     U+FEFF character is ignored.
  |   - An external style sheet should start with a BOM if it is
  |     encoded as "UTF-16" or "UTF-32" and should not have a BOM in
  |     any other encodings.
  | Note that the BOM can only be ignored if it agrees with the
  | encoding. E.g., if a style sheet encoded as "UTF-8" starts with
  | 0xEF 0xBB 0xBF those three bytes are ignored, since they correctly
  | encode the character U+FEFF in UTF-8. But if a style sheet encoded
  | as "ISO-8859-1" starts with the two bytes 0xFE 0xFF (the BOM for
  | big-endian UTF-16), the two bytes are simply interpreted as the
  | two characters "■" and " ".

    Note that reliance on the @charset construct theoretically poses a
    problem since there is no a priori information on how it is
    encoded. In practice, however, the encodings in wide use on the
    Internet are either based on ASCII, UTF-16, UCS-4, or (rarely) on
    EBCDIC. This means that in general, the initial byte values of a
    document enable a user agent to detect the encoding family
    reliably, which provides enough information to decode the @charset
    rule, which in turn determines the exact character encoding.

It's a mess :-( Is there no way to forbid both the @charset and the

  Bert Bos                                ( W 3 C )                              W3C/ERCIM                             2004 Rt des Lucioles / BP 93
  +33 (0)4 92 38 76 92            06902 Sophia Antipolis Cedex, France

Received on Wednesday, 10 December 2003 09:30:26 UTC