Re: [CSS21] BOM & @charset (issues 44 & 115) from Ian Hickson on 2004-02-18 (www-style@w3.org from February 2004)

From: Ian Hickson <ian@hixie.ch>
Date: Wed, 18 Feb 2004 02:04:24 +0000 (UTC)
To: Boris Zbarsky <bzbarsky@MIT.EDU>
Cc: Bert Bos <bert@w3.org>, www-style@w3.org
Message-ID: <Pine.LNX.4.58.0402180138390.2286@dhalsim.dreamhost.com>

On Tue, 17 Feb 2004, Boris Zbarsky wrote:
>
> But a two-byte LE BOM followed by @charset "Mysomething"; would be
> treated as "Mysomething"?  Or what?

Is Mysomething a two-byte encoding which has the same BOM codepoint as a
UTF-16LE BOM? If yes, then yes, since Mysomething doesn't contradict the
BOM, but merely clarify it. Otherwise, no. This is based on the fact that
the BOM is higher on the list, so overrides the @charset.

> I'm not sure what the impact of that "ignored for charsets other than
> UTF-16" language is...  especially in light of your examples.

It means, e.g., that if you know the encoding is UTF-8 based on the
Content-Type header, you ignore a leading BOM.

> If there are no cases when two substantially different charsets can have
> the same BOM, there is indeed no such situations.  Are there indeed no
> such cases?

As far as I know the only encodings that can represent U+FEFF are:

   UTF-16, UCS-2 (LE and BE variants)
   UTF-32, UCS-4 (LE, BE, 3412, and 2143 variants)
   UTF-8
   UTF-7 (deprecated)

The only encoding difference between UTF-16 and UCS-2 is whether or not
the surrogate codepoints can be used, so for all intents and purposes
UCS-2 can be treated identically to UTF-16. UCS-4 and UCS-32 are identical
as far as encoding goes. UTF-7 is dead.

So as far as I am aware, there are no cases where even trivially different
character encodings can have the same BOM, much less substantially
different encodings.

>>>> A. What happens if you have a UTF-16 BOM, and an @charset encoded as
>>>> UTF-16 which claims it is ISO-8859-1?
>>>
>>> We will end up treating the sheet as ISO-8859-1 at the moment.
>>
>> I presume you agree that is suboptimal?
>
> Frankly, if a sheet has an @charset rule that does not match the actual
> sheet data (or does it?) no matter what you do is suboptimal -- you have
> a good chance of getting it "wrong" either way.

True, but it seems likely that the encoding of the @charset is more likely
to be right than the given encoding.

> Part of the issue here is that an @charset rule seems to have two things
> it does: 1) hint at the charset  2)  hint at how the sheet is to be
> serialized.   And it sounds like we want to handle cases where the first
> type of hint is wrong?

The first case is the one being covered here, yes.

-- 
Ian Hickson                                      )\._.,--....,'``.    fL
U+1047E                                         /,   _.. \   _\  ;`._ ,.
http://index.hixie.ch/                         `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 17 February 2004 21:04:26 UTC