Re: [CSS21] BOM & @charset (issues 44 & 115) from Ian Hickson on 2004-02-18 (www-style@w3.org from February 2004)

From: Ian Hickson <ian@hixie.ch>
Date: Wed, 18 Feb 2004 02:44:23 +0000 (UTC)
To: Boris Zbarsky <bzbarsky@MIT.EDU>
Cc: Bert Bos <bert@w3.org>, www-style@w3.org
Message-ID: <Pine.LNX.4.58.0402180229380.2286@dhalsim.dreamhost.com>

On Tue, 17 Feb 2004, Boris Zbarsky wrote:
>
> So in other words, in the proposed setup one would need to do the
> following to be robust wrt new encodings that may appear:
>
> 1)  See whether there is a BOM.
> 2)  Parse the @charset rule.
> 3)  If there is an @charset rule and a BOM, encode U+FEFF using the
>      charset from the @charset rule and see whether this agrees with the
>      BOM's representation.  If it does, use the @charset charset.  If
>      not, guess a charset based on the BOM's encoding.

That would be compliant, I think. It should be equivalent to the following
algorithm. Follow the steps given until the set of encodings declared in
step zero has only one remaining encoding. If a step would reduce the
number of encodings to zero, skip that step.

 0) Set the set of encodings to include all known encodings.
 1) If there is an HTTP Content-Type header, reduce the set of encodings
    to the set of encodings that the Content-Type header covered. (e.g.
    if it said "text/css;charset=utf-16" then the set would be UTF-16LE,
    UTF-16BE.)
 2) See if you can detect a BOM. If so, use that to reduce the set of
    encodings to the the set of encodings that have that BOM.
 3) See if you can detect an '@charset' in one of the encodings in
    the set. If so, see if it says that the encoding is one of the
    encodings in the current set. If so, reduce the set to the encodings
    that match what the @charset rule specified.
 4) Do the same again, using any metadata from the linking mechanism,
    such as <link charset="">.
 5) If there is still more than one encoding in the set, and you have a
    refering document or stylesheet, and it has a known encoding that is
    one of the encodings in the set, then reduce the set to just that
    encoding.
 6) Use a UA-dependent mechanism to narrow the set down to one encoding.

Sound right? I believe this is exactly equivalent to what the spec says
now, but in more detail than we want for 2.1. Maybe CSS3 could use a more
explicit algorithm like the above.

The one thing that that algorithm doesn't say is how to cope with the case
where in step 3, you detect an @charset, and the given encoding is in the
set, but the set of encodings that would detect the @charset and the set
of encodings that are covered by the given encoding do not overlap.


>> As far as I know the only encodings that can represent U+FEFF are:
>
> I should clarify that I am by no means an intl expert.  Hence all the
> questions.  It's just that I tend to assume that any situation like the
> one we have with encodings will deteriorate (that is, more random
> encodings will appear, possibly overlapping existing ones in most
> undesirable ways).

One would hope, given the existence of Unicode, that we will not be seeing
new encodings any more (except in specialist fields such as Punycode for
IDN, but that doesn't really count).

-- 
Ian Hickson                                      )\._.,--....,'``.    fL
U+1047E                                         /,   _.. \   _\  ;`._ ,.
http://index.hixie.ch/                         `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 17 February 2004 21:44:25 UTC