- From: Asmus Freytag <asmusf@ix.netcom.com>
- Date: Tue, 16 Dec 2003 00:38:13 -0800
- To: François Yergeau <francois@yergeau.com>, Bert Bos <bert@w3.org>
- Cc: www-international@w3.org, www-style@w3.org
At 07:40 AM 12/11/2003, François Yergeau wrote:
>Bert Bos a écrit :
>>I've written some new text for section 4.4 of CSS 2.1[1]. [1]
>>http://www.w3.org/TR/CSS21/syndata.html#q23
>>...
>> 1. An HTTP "charset" parameter in a "Content-Type" field.
>> 2. The @charset at-rule.
>> 3. Mechanisms of the language of the referencing document
>> (e.g., in HTML, the "charset" attribute of the LINK
>> element).
>> | 4. UA-dependent mechanisms (e.g., guessing based on the BOM)
>
>That's not good, the BOM belongs in 2, along with @charset. Both are of
>the same nature: in-band identification of the character encoding. Both
>are equally valid ways to do this (but the BOM is limited to Unicode
>encodings). Using the BOM to identify encoding is not a guess any more
>than using @charset is. It should not be UA-dependent any more than @charset.
I would tend to agree.
>Oh, and in 1 it should be a little wider than just HTTP: there's also
>HTTPS, multipart mail with MIME headers, other similar things possibly now
>and almost certainly in the future. I recently suggested using "external
>character encoding information (such as MIME or HTTP headers)", slightly
>adapted from the XML spec.
>
>> At most one @charset rule may appear in an external style sheet
>> | and it must appear at the very start of the document, not preceded
>> | by any characters, except possibly a "BOM" (see below). Any other
>> | @charset rules must be ignored by the UA.
>
>That's good. I guess you did not like my suggestion of integrating the
>BOM in the grammar instead of discussing it in the prose?
>
>> This specification does not mandate which character encodings a
>> user agent must support.
>
>It should (UTF-8, UTF-16). Perhaps CSS3 will?
How will you be writing portable style sheets, if you can't rely on either
one of these to be present?
>> | If an external style sheet has U+FEFF ("zero width non-breaking
>> | space") as the first character (i.e., even before any @charset
>> | rule), this character is interpreted as a so-called "Byte Order
>> | Mark" (BOM), as follows:
>> |
>> | - If the style sheet is encoded as "UTF-16" [RFC2781] or
>> | "UTF-32" [UNICODE], the BOM determines the byte order
>> | ("big-endian" or "little-endian") as explained in the cited
>> | RFC. If the style sheet is encoded as anything else, the
>> | U+FEFF character is ignored.
>
>This is the wrong way around, IMHO. If a UTF-16(BE|LE) BOM is found, then
>the encoding is determined to be UTF-16(BE|LE). Same for UTF-32 and
>UTF-8. U+FEFF is the UCS signature and has been since the first edition
>of ISO 10646 in 1993. Its function is to indicate that the text is in
>Unicode and to tell in which particular encoding scheme of Unicode,
>including byte order in the case of the multibyte encodings. The above
>makes too much of the BOM moniker, which is only a moniker; it's a
>signature, even in UTF-8 where the byte order aspect is non sequitur.
Since the BOM comes before any @charset is seen, it would seem that a
conflicting @charset should be ignored, but a conflicting external encoding
declaration should invalidate the function of the BOM as encoding signature.
Only if the external declaration is UTF-16 or 32 does the BOM have the
additional semantics of selecting Byte Order, If the external declaration
is UTF-16BE, UTF-16LE, etc, then, by Unicode rules, no BOM may be present,
at which point the first character in the style sheet is a ZWNBSP (or an
error, if you wish).
>> | - An external style sheet should start with a BOM if it is
>> | encoded as "UTF-16" or "UTF-32" and should not have a BOM in
>> | any other encodings.
>
>Add UTF-8. The UTF-8 signature has been standardized since UTF-8 has been
>introduced in the standard in 1994 or thereabouts and is a UCS signature
>just like the others.
Agreed.
>> | Note that the BOM can only be ignored if it agrees with the
>> | encoding. E.g., if a style sheet encoded as "UTF-8" starts with
>> | 0xEF 0xBB 0xBF those three bytes are ignored, since they correctly
>> | encode the character U+FEFF in UTF-8. But if a style sheet encoded
>> | as "ISO-8859-1" starts with the two bytes 0xFE 0xFF (the BOM for
>> | big-endian UTF-16), the two bytes are simply interpreted as the
>> | two characters "þ" and "ÿ".
>
>That's a bit confusing. Normally the BOM serves to identify the encoding
>and finding 0xFE 0xFF will tell you that the style sheet is in UTF-16BE,
>not in ISO-8859-1. If you want to say that the ss was identified to be
>ISO-8859-1 before seeing the BOM (e.g. by the HTTP charset), then just say
>so, to be clear.
That's the only way in which the statement above makes sense, and I read it
that way, byt Francois is right, it should say so.
>>It's a mess :-( Is there no way to forbid both the @charset and the
>>BOM in CSS?
>
>Yes: mandate that all style sheets must be in UTF-8 and be done with it :-)
No, you still get UTF-8 that's labelled with the BOM to distinguish it from
8859-1.
I think the suggestion to put BOM in the hierarchy between HTTP and
@charset and to treat any @charset following a BOM the same as a duplicate
@charset should clear up the picture.
A./
PS; this caught my attention today since I've been editing the Unicode FAQ
on the BOM all day (see
http://www.unicode.org/faq/utf_bom-d4.html for today's draft (temporary
location)).
Received on Tuesday, 16 December 2003 03:38:03 UTC