Re: [CSS21] BOM & @charset (issues 44 & 115)

On Tue, 17 Feb 2004, Boris Zbarsky wrote:
>
> Sure.  Different encodings can have the same BOM (eg UTF-16 and UCS-2,
> but there may also be other cases that are not quite so trivial).

This case is not a problem -- since UTF-16 is a superset of UCS-2, simply
treat it as UTF-16.

But in any case, the change Bert mentioned doesn't remove the previous
text, which says:

| If an external style sheet has U+FEFF ("zero width non-breaking space")
| as the first character (i.e., even before any @charset rule), this
| character is interpreted as a so-called "Byte Order Mark" (BOM), as
| follows:
|
|     * If the style sheet is encoded as "UTF-16" [RFC2781] or "UTF-32"
|       [UNICODE], the BOM determines the byte order (e.g. "big-endian" or
|       "little-endian") as explained in the cited RFC.
|     * If the style sheet is encoded as anything else, the U+FEFF
|       character is ignored.

This doesn't conflict with the steps Bert mentioned, but it does clarify
that @charset is still relevant even if there is a BOM.

(The text "BOM" in the steps links straight to this text.)

The text before the steps says that "user agents must observe the
following priorities when determining a style sheet's character encoding
(from highest priority to lowest)". So as long as a later step doesn't
contradict an earlier one, it is still applicable.


> In case anyone is interested in what Mozilla does right now, the basic
> algorithm for this step is: [...]

A. What happens if you have a UTF-16 BOM, and an @charset encoded as
UTF-16 which claims it is ISO-8859-1?

B. Or no BOM, US-ASCII encoded @charset which claims to be UCS-4?

C. Or UTF-8 BOM followed by US-ASCII @charset claiming ISO-8859-1?

D. A UTF-16 BOM with no @charset being linked from a stylesheet or
document that is known to be in UCS-2?

E. a document whose odd bytes spell a US-ASCII @charset claiming UTF-16BE
and whose even bytes spell a US-ASCII @charset claiming UTF-16LE, linked
from a document or stylesheet claiming UTF-8?

etc.

These are the cases that these steps are partially clarifying. Per the
text currently in the spec which we hope to have go to CR, in case A the
UA would use UTF-16 (the BOM trumps the @charset), case B is undefined,
case C would use UTF-8, case D would use UTF-16 (it can't be UCS-2, since
if it was the BOM would have to be ignored per the text quoted above, and
the BOM comes before linking metadata in the list), and case E would use
UTF-8 (since the start doesn't contain an @charset rule in any character
encoding).

Case B will be covered by CSS3 Syntax.

-- 
Ian Hickson                                      )\._.,--....,'``.    fL
U+1047E                                         /,   _.. \   _\  ;`._ ,.
http://index.hixie.ch/                         `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 17 February 2004 19:24:52 UTC