- From: Bert Bos <bert@w3.org>
- Date: Wed, 10 Dec 2003 15:29:02 +0100
- To: www-international@w3.org, www-style@w3.org
François Yergeau writes: > L. David Baron a écrit : > >> > EncodingDecl = [BOM][@charset=<foobar>] > >> > > >> >with the additional constraint that EncodingDecl must occur at the > >> >start of the stylesheet. > > > > I think the main advantage of such a change would be clarity. (Or is > > there some other advantage you were thinking of?) > > No, just that: make it explicit. > > > I agree that it makes > > it clearer that the BOM is allowed, but it might make it less clear that > > the processing of the encoding declaration is an entirely separate > > process from the tokenization and parsing of the stylesheet. > > Hmmm, good point, but isn't it already the case with @charset? I've written some new text for section 4.4 of CSS 2.1[1]. Here is my attempt at explaining the BOM. The paragraph after the first list now mentions that the BOM may occur, even before @charset. And there is a new section and a new note that detail what UAs and authors have to do with the BOM. Changed text marked with "|" below. [1] http://www.w3.org/TR/CSS21/syndata.html#q23 4.4 CSS document representation A CSS style sheet is a sequence of characters from the Universal Character Set (see [ISO10646]). For transmission and storage, these characters must be encoded by a character encoding that supports the set of characters available in US-ASCII (e.g., ISO 8859-x, SHIFT JIS, etc.). For a good introduction to character sets and character encodings, please consult the HTML 4.0 specification ([HTML40], chapter 5), See also the XML 1.0 specification ([XML10], sections 2.2 and 4.3.3, and Appendix F. When a style sheet is embedded in another document, such as in the STYLE element or "style" attribute of HTML, the style sheet shares the character encoding of the whole document. When a style sheet resides in a separate file, user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest): 1. An HTTP "charset" parameter in a "Content-Type" field. 2. The @charset at-rule. 3. Mechanisms of the language of the referencing document (e.g., in HTML, the "charset" attribute of the LINK element). | 4. UA-dependent mechanisms (e.g., guessing based on the BOM) At most one @charset rule may appear in an external style sheet | and it must appear at the very start of the document, not preceded | by any characters, except possibly a "BOM" (see below). Any other | @charset rules must be ignored by the UA. After "@charset", authors specify the name of a character encoding. The name must be a charset name as described in the IANA registry (See [IANA]. Also, see [CHARSETS] for a complete list of charsets). For example: @charset "ISO-8859-1"; This specification does not mandate which character encodings a user agent must support. | If an external style sheet has U+FEFF ("zero width non-breaking | space") as the first character (i.e., even before any @charset | rule), this character is interpreted as a so-called "Byte Order | Mark" (BOM), as follows: | | - If the style sheet is encoded as "UTF-16" [RFC2781] or | "UTF-32" [UNICODE], the BOM determines the byte order | ("big-endian" or "little-endian") as explained in the cited | RFC. If the style sheet is encoded as anything else, the | U+FEFF character is ignored. | | - An external style sheet should start with a BOM if it is | encoded as "UTF-16" or "UTF-32" and should not have a BOM in | any other encodings. | | Note that the BOM can only be ignored if it agrees with the | encoding. E.g., if a style sheet encoded as "UTF-8" starts with | 0xEF 0xBB 0xBF those three bytes are ignored, since they correctly | encode the character U+FEFF in UTF-8. But if a style sheet encoded | as "ISO-8859-1" starts with the two bytes 0xFE 0xFF (the BOM for | big-endian UTF-16), the two bytes are simply interpreted as the | two characters "þ" and "ÿ". Note that reliance on the @charset construct theoretically poses a problem since there is no a priori information on how it is encoded. In practice, however, the encodings in wide use on the Internet are either based on ASCII, UTF-16, UCS-4, or (rarely) on EBCDIC. This means that in general, the initial byte values of a document enable a user agent to detect the encoding family reliably, which provides enough information to decode the @charset rule, which in turn determines the exact character encoding. It's a mess :-( Is there no way to forbid both the @charset and the BOM in CSS? Bert -- Bert Bos ( W 3 C ) http://www.w3.org/ http://www.w3.org/people/bos/ W3C/ERCIM bert@w3.org 2004 Rt des Lucioles / BP 93 +33 (0)4 92 38 76 92 06902 Sophia Antipolis Cedex, France
Received on Wednesday, 10 December 2003 09:30:26 UTC