Re: UTF-8 signature / BOM in CSS

At 07:40 AM 12/11/2003, François Yergeau wrote:

>Bert Bos a écrit :
>>I've written some new text for section 4.4 of CSS 2.1[1]. [1] 
>>http://www.w3.org/TR/CSS21/syndata.html#q23
>>...
>>        1. An HTTP "charset" parameter in a "Content-Type" field.
>>        2. The @charset at-rule.
>>        3. Mechanisms of the language of the referencing document
>>           (e.g., in HTML, the "charset" attribute of the LINK
>>           element).
>>   |    4. UA-dependent mechanisms (e.g., guessing based on the BOM)
>
>That's not good, the BOM belongs in 2, along with @charset.  Both are of 
>the same nature: in-band identification of the character encoding.  Both 
>are equally valid ways to do this (but the BOM is limited to Unicode 
>encodings).  Using the BOM to identify encoding is not a guess any more 
>than using @charset is.  It should not be UA-dependent any more than @charset.

I would tend to agree.


>Oh, and in 1 it should be a little wider than just HTTP: there's also 
>HTTPS, multipart mail with MIME headers, other similar things possibly now 
>and almost certainly in the future.  I recently suggested using "external 
>character encoding information (such as MIME or HTTP headers)", slightly 
>adapted from the XML spec.
>
>>     At most one @charset rule may appear in an external style sheet
>>   | and it must appear at the very start of the document, not preceded
>>   | by any characters, except possibly a "BOM" (see below). Any other
>>   | @charset rules must be ignored by the UA.
>
>That's good.  I guess you did not like my suggestion of integrating the 
>BOM in the grammar instead of discussing it in the prose?
>
>>     This specification does not mandate which character encodings a
>>     user agent must support.
>
>It should (UTF-8, UTF-16).  Perhaps CSS3 will?

How will you be writing portable style sheets, if you can't rely on either 
one of these to be present?


>>   | If an external style sheet has U+FEFF ("zero width non-breaking
>>   | space") as the first character (i.e., even before any @charset
>>   | rule), this character is interpreted as a so-called "Byte Order
>>   | Mark" (BOM), as follows:
>>   |
>>   |   - If the style sheet is encoded as "UTF-16" [RFC2781] or
>>   |     "UTF-32" [UNICODE], the BOM determines the byte order
>>   |     ("big-endian" or "little-endian") as explained in the cited
>>   |     RFC. If the style sheet is encoded as anything else, the
>>   |     U+FEFF character is ignored.
>
>This is the wrong way around, IMHO.  If a UTF-16(BE|LE) BOM is found, then 
>the encoding is determined to be UTF-16(BE|LE).  Same for UTF-32 and 
>UTF-8.  U+FEFF is the UCS signature and has been since the first edition 
>of ISO 10646 in 1993.  Its function is to indicate that the text is in 
>Unicode and to tell in which particular encoding scheme of Unicode, 
>including byte order in the case of the multibyte encodings. The above 
>makes too much of the BOM moniker, which is only a moniker; it's a 
>signature, even in UTF-8 where the byte order aspect is non sequitur.

Since the BOM comes before any @charset is seen, it would seem that a 
conflicting @charset should be ignored, but a conflicting external encoding 
declaration should invalidate the function of the BOM as encoding signature.

Only if the external declaration is UTF-16 or 32 does the BOM have the 
additional semantics of selecting Byte Order, If the external declaration 
is UTF-16BE, UTF-16LE, etc, then, by Unicode rules, no BOM may be present, 
at which point the first character in the style sheet is a ZWNBSP (or an 
error, if you wish).


>>   |   - An external style sheet should start with a BOM if it is
>>   |     encoded as "UTF-16" or "UTF-32" and should not have a BOM in
>>   |     any other encodings.
>
>Add UTF-8.  The UTF-8 signature has been standardized since UTF-8 has been 
>introduced in the standard in 1994 or thereabouts and is a UCS signature 
>just like the others.

Agreed.


>>   | Note that the BOM can only be ignored if it agrees with the
>>   | encoding. E.g., if a style sheet encoded as "UTF-8" starts with
>>   | 0xEF 0xBB 0xBF those three bytes are ignored, since they correctly
>>   | encode the character U+FEFF in UTF-8. But if a style sheet encoded
>>   | as "ISO-8859-1" starts with the two bytes 0xFE 0xFF (the BOM for
>>   | big-endian UTF-16), the two bytes are simply interpreted as the
>>   | two characters "þ" and "ÿ".
>
>That's a bit confusing.  Normally the BOM serves to identify the encoding 
>and finding 0xFE 0xFF will tell you that the style sheet is in UTF-16BE, 
>not in ISO-8859-1.  If you want to say that the ss was identified to be 
>ISO-8859-1 before seeing the BOM (e.g. by the HTTP charset), then just say 
>so, to be clear.

That's the only way in which the statement above makes sense, and I read it 
that way, byt Francois is right, it should say so.


>>It's a mess :-( Is there no way to forbid both the @charset and the
>>BOM in CSS?
>
>Yes: mandate that all style sheets must be in UTF-8 and be done with it :-)

No, you still get UTF-8 that's labelled with the BOM to distinguish it from 
8859-1.

I think the suggestion to put BOM in the hierarchy between HTTP and 
@charset and to treat any @charset following a BOM the same as a duplicate 
@charset should clear up the picture.

A./

PS; this caught my attention today since I've been editing the Unicode FAQ 
on the BOM all day (see
http://www.unicode.org/faq/utf_bom-d4.html for today's draft (temporary 
location)).

Received on Tuesday, 16 December 2003 03:38:03 UTC