Re: [CSS21] BOM & @charset (issues 44 & 115) from Boris Zbarsky on 2004-02-17 (www-style@w3.org from February 2004)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Tue, 17 Feb 2004 17:54:49 -0600
To: Bert Bos <bert@w3.org>
Cc: www-style@w3.org
Message-ID: <4032A9C9.8050309@mit.edu>

Bert Bos wrote:
>    1. HTTP header
>    2. BOM
>    3. @charset
>    4. etc.
> 
> But this is complicated material, so: does anybody see a problem with
> this?

Sure.  Different encodings can have the same BOM (eg UTF-16 and UCS-2, 
but there may also be other cases that are not quite so trivial).

In case anyone is interested in what Mozilla does right now, the basic 
algorithm for this step is:

1)  Check for a BOM.  If there is one, set the "generic encoding family"
     (# of bytes per ASCII character and byte order based on that).
2)  If there is no BOM, see whether the first couple of bytes look like
     a '@' in some random encoding.  If they do, set the "generic
     encoding family" based on what it looks like.
3)  If there was a '@' try to parse the whole @charset rule using the
     "generic encoding family" info to find the actual ascii chars (eg
     for UTF-16/UCS-2/whatever, look at every other byte).
4)  If there wasn't an @charset rule but we got a "generic encoding
     family" based on the BOM, go with the most common or inclusive
     charset we know of that fits those criteria (eg we would go with
     UTF-16 over UCS-2)

Steps 1, 2, 3 are basically what the XML 1.0 spec describes as the way 
to handle XML decls and BOMs (this is the spec that's references in the 
encoding selection section of CSS2, iirc).

-Boris

Received on Tuesday, 17 February 2004 18:54:52 UTC