- From: Asmus Freytag <asmusf@ix.netcom.com>
- Date: Tue, 16 Dec 2003 00:38:13 -0800
- To: François Yergeau <francois@yergeau.com>, Bert Bos <bert@w3.org>
- Cc: www-international@w3.org, www-style@w3.org
At 07:40 AM 12/11/2003, François Yergeau wrote: >Bert Bos a écrit : >>I've written some new text for section 4.4 of CSS 2.1[1]. [1] >>http://www.w3.org/TR/CSS21/syndata.html#q23 >>... >> 1. An HTTP "charset" parameter in a "Content-Type" field. >> 2. The @charset at-rule. >> 3. Mechanisms of the language of the referencing document >> (e.g., in HTML, the "charset" attribute of the LINK >> element). >> | 4. UA-dependent mechanisms (e.g., guessing based on the BOM) > >That's not good, the BOM belongs in 2, along with @charset. Both are of >the same nature: in-band identification of the character encoding. Both >are equally valid ways to do this (but the BOM is limited to Unicode >encodings). Using the BOM to identify encoding is not a guess any more >than using @charset is. It should not be UA-dependent any more than @charset. I would tend to agree. >Oh, and in 1 it should be a little wider than just HTTP: there's also >HTTPS, multipart mail with MIME headers, other similar things possibly now >and almost certainly in the future. I recently suggested using "external >character encoding information (such as MIME or HTTP headers)", slightly >adapted from the XML spec. > >> At most one @charset rule may appear in an external style sheet >> | and it must appear at the very start of the document, not preceded >> | by any characters, except possibly a "BOM" (see below). Any other >> | @charset rules must be ignored by the UA. > >That's good. I guess you did not like my suggestion of integrating the >BOM in the grammar instead of discussing it in the prose? > >> This specification does not mandate which character encodings a >> user agent must support. > >It should (UTF-8, UTF-16). Perhaps CSS3 will? How will you be writing portable style sheets, if you can't rely on either one of these to be present? >> | If an external style sheet has U+FEFF ("zero width non-breaking >> | space") as the first character (i.e., even before any @charset >> | rule), this character is interpreted as a so-called "Byte Order >> | Mark" (BOM), as follows: >> | >> | - If the style sheet is encoded as "UTF-16" [RFC2781] or >> | "UTF-32" [UNICODE], the BOM determines the byte order >> | ("big-endian" or "little-endian") as explained in the cited >> | RFC. If the style sheet is encoded as anything else, the >> | U+FEFF character is ignored. > >This is the wrong way around, IMHO. If a UTF-16(BE|LE) BOM is found, then >the encoding is determined to be UTF-16(BE|LE). Same for UTF-32 and >UTF-8. U+FEFF is the UCS signature and has been since the first edition >of ISO 10646 in 1993. Its function is to indicate that the text is in >Unicode and to tell in which particular encoding scheme of Unicode, >including byte order in the case of the multibyte encodings. The above >makes too much of the BOM moniker, which is only a moniker; it's a >signature, even in UTF-8 where the byte order aspect is non sequitur. Since the BOM comes before any @charset is seen, it would seem that a conflicting @charset should be ignored, but a conflicting external encoding declaration should invalidate the function of the BOM as encoding signature. Only if the external declaration is UTF-16 or 32 does the BOM have the additional semantics of selecting Byte Order, If the external declaration is UTF-16BE, UTF-16LE, etc, then, by Unicode rules, no BOM may be present, at which point the first character in the style sheet is a ZWNBSP (or an error, if you wish). >> | - An external style sheet should start with a BOM if it is >> | encoded as "UTF-16" or "UTF-32" and should not have a BOM in >> | any other encodings. > >Add UTF-8. The UTF-8 signature has been standardized since UTF-8 has been >introduced in the standard in 1994 or thereabouts and is a UCS signature >just like the others. Agreed. >> | Note that the BOM can only be ignored if it agrees with the >> | encoding. E.g., if a style sheet encoded as "UTF-8" starts with >> | 0xEF 0xBB 0xBF those three bytes are ignored, since they correctly >> | encode the character U+FEFF in UTF-8. But if a style sheet encoded >> | as "ISO-8859-1" starts with the two bytes 0xFE 0xFF (the BOM for >> | big-endian UTF-16), the two bytes are simply interpreted as the >> | two characters "þ" and "ÿ". > >That's a bit confusing. Normally the BOM serves to identify the encoding >and finding 0xFE 0xFF will tell you that the style sheet is in UTF-16BE, >not in ISO-8859-1. If you want to say that the ss was identified to be >ISO-8859-1 before seeing the BOM (e.g. by the HTTP charset), then just say >so, to be clear. That's the only way in which the statement above makes sense, and I read it that way, byt Francois is right, it should say so. >>It's a mess :-( Is there no way to forbid both the @charset and the >>BOM in CSS? > >Yes: mandate that all style sheets must be in UTF-8 and be done with it :-) No, you still get UTF-8 that's labelled with the BOM to distinguish it from 8859-1. I think the suggestion to put BOM in the hierarchy between HTTP and @charset and to treat any @charset following a BOM the same as a duplicate @charset should clear up the picture. A./ PS; this caught my attention today since I've been editing the Unicode FAQ on the BOM all day (see http://www.unicode.org/faq/utf_bom-d4.html for today's draft (temporary location)).
Received on Tuesday, 16 December 2003 03:38:03 UTC