- From: François Yergeau <francois@yergeau.com>
- Date: Thu, 11 Dec 2003 10:40:34 -0500
- To: Bert Bos <bert@w3.org>
- Cc: www-international@w3.org, www-style@w3.org
Bert Bos a écrit : > I've written some new text for section 4.4 of CSS 2.1[1]. > [1] http://www.w3.org/TR/CSS21/syndata.html#q23 > ... > 1. An HTTP "charset" parameter in a "Content-Type" field. > > 2. The @charset at-rule. > > 3. Mechanisms of the language of the referencing document > (e.g., in HTML, the "charset" attribute of the LINK > element). > > | 4. UA-dependent mechanisms (e.g., guessing based on the BOM) That's not good, the BOM belongs in 2, along with @charset. Both are of the same nature: in-band identification of the character encoding. Both are equally valid ways to do this (but the BOM is limited to Unicode encodings). Using the BOM to identify encoding is not a guess any more than using @charset is. It should not be UA-dependent any more than @charset. Oh, and in 1 it should be a little wider than just HTTP: there's also HTTPS, multipart mail with MIME headers, other similar things possibly now and almost certainly in the future. I recently suggested using "external character encoding information (such as MIME or HTTP headers)", slightly adapted from the XML spec. > At most one @charset rule may appear in an external style sheet > | and it must appear at the very start of the document, not preceded > | by any characters, except possibly a "BOM" (see below). Any other > | @charset rules must be ignored by the UA. That's good. I guess you did not like my suggestion of integrating the BOM in the grammar instead of discussing it in the prose? > This specification does not mandate which character encodings a > user agent must support. It should (UTF-8, UTF-16). Perhaps CSS3 will? > | If an external style sheet has U+FEFF ("zero width non-breaking > | space") as the first character (i.e., even before any @charset > | rule), this character is interpreted as a so-called "Byte Order > | Mark" (BOM), as follows: > | > | - If the style sheet is encoded as "UTF-16" [RFC2781] or > | "UTF-32" [UNICODE], the BOM determines the byte order > | ("big-endian" or "little-endian") as explained in the cited > | RFC. If the style sheet is encoded as anything else, the > | U+FEFF character is ignored. This is the wrong way around, IMHO. If a UTF-16(BE|LE) BOM is found, then the encoding is determined to be UTF-16(BE|LE). Same for UTF-32 and UTF-8. U+FEFF is the UCS signature and has been since the first edition of ISO 10646 in 1993. Its function is to indicate that the text is in Unicode and to tell in which particular encoding scheme of Unicode, including byte order in the case of the multibyte encodings. The above makes too much of the BOM moniker, which is only a moniker; it's a signature, even in UTF-8 where the byte order aspect is non sequitur. > | - An external style sheet should start with a BOM if it is > | encoded as "UTF-16" or "UTF-32" and should not have a BOM in > | any other encodings. Add UTF-8. The UTF-8 signature has been standardized since UTF-8 has been introduced in the standard in 1994 or thereabouts and is a UCS signature just like the others. > | Note that the BOM can only be ignored if it agrees with the > | encoding. E.g., if a style sheet encoded as "UTF-8" starts with > | 0xEF 0xBB 0xBF those three bytes are ignored, since they correctly > | encode the character U+FEFF in UTF-8. But if a style sheet encoded > | as "ISO-8859-1" starts with the two bytes 0xFE 0xFF (the BOM for > | big-endian UTF-16), the two bytes are simply interpreted as the > | two characters "þ" and "ÿ". That's a bit confusing. Normally the BOM serves to identify the encoding and finding 0xFE 0xFF will tell you that the style sheet is in UTF-16BE, not in ISO-8859-1. If you want to say that the ss was identified to be ISO-8859-1 before seeing the BOM (e.g. by the HTTP charset), then just say so, to be clear. > It's a mess :-( Is there no way to forbid both the @charset and the > BOM in CSS? Yes: mandate that all style sheets must be in UTF-8 and be done with it :-) -- François
Received on Thursday, 11 December 2003 10:42:21 UTC