Re: [CSS21] BOM & @charset (issues 44 & 115) from Boris Zbarsky on 2004-02-18 (www-style@w3.org from February 2004)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Tue, 17 Feb 2004 20:15:34 -0600
To: Ian Hickson <ian@hixie.ch>
Cc: Bert Bos <bert@w3.org>, www-style@w3.org
Message-ID: <4032CAC6.7040101@mit.edu>

Ian Hickson wrote:
>>But a two-byte LE BOM followed by @charset "Mysomething"; would be
>>treated as "Mysomething"?  Or what?
> 
> Is Mysomething a two-byte encoding which has the same BOM codepoint as a
> UTF-16LE BOM? If yes, then yes, since Mysomething doesn't contradict the
> BOM, but merely clarify it. Otherwise, no. This is based on the fact that
> the BOM is higher on the list, so overrides the @charset.

So in other words, in the proposed setup one would need to do the 
following to be robust wrt new encodings that may appear:

1)  See whether there is a BOM.
2)  Parse the @charset rule.
3)  If there is an @charset rule and a BOM, encode U+FEFF using the
     charset from the @charset rule and see whether this agrees with the
     BOM's representation.  If it does, use the @charset charset.  If
     not, guess a charset based on the BOM's encoding.

Sound right?

> It means, e.g., that if you know the encoding is UTF-8 based on the
> Content-Type header, you ignore a leading BOM.

Ah, ok.

> As far as I know the only encodings that can represent U+FEFF are:

I should clarify that I am by no means an intl expert.  Hence all the 
questions.  It's just that I tend to assume that any situation like the 
one we have with encodings will deteriorate (that is, more random 
encodings will appear, possibly overlapping existing ones in most 
undesirable ways).

I would dearly like to be wrong in this, of course.  ;)

> True, but it seems likely that the encoding of the @charset is more likely
> to be right than the given encoding.

Fair enough.

-Boris

Received on Tuesday, 17 February 2004 21:15:37 UTC