W3C home > Mailing lists > Public > www-style@w3.org > February 2004

Re: [CSS21] BOM & @charset (issues 44 & 115)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Tue, 17 Feb 2004 20:59:20 -0600
Message-ID: <4032D508.5040700@mit.edu>
To: Ian Hickson <ian@hixie.ch>
Cc: Bert Bos <bert@w3.org>, www-style@w3.org

Ian Hickson wrote:
> That would be compliant, I think. It should be equivalent to the following
> algorithm.

Unfortunately, your algorithm makes some unwarranted assumptions about 
the information the user agent has... and the sort of encodings it can 
handle.  And they two algorithms are not quite equivalent in edge cases 
like people specifying UTF-16 without LE or BE and with no BOM.

>  0) Set the set of encodings to include all known encodings.

There may be no way to get such a list from the unicode conversion system...

>  1) If there is an HTTP Content-Type header, reduce the set of encodings
>     to the set of encodings that the Content-Type header covered. (e.g.
>     if it said "text/css;charset=utf-16" then the set would be UTF-16LE,
>     UTF-16BE.)

This requires either hardcoding knowledge of that sort in the user-agent 
or assuming that the unicode conversion system has an api for this.  I 
don't believe iconv has such an api, for example.  Do Windows or Mac OSX 
have such apis?  Hardcoding means that if new encodings appear and your 
unicode conversion system adds support for them you're still broken.

>  2) See if you can detect a BOM. If so, use that to reduce the set of
>     encodings to the the set of encodings that have that BOM.

This requires either hardcoding knowledge of what a BOM looks like in 
various encodings, having an api to ask for that information from the 
unicode conversion system, or trying to encode it in each encoding...

Given my admittedly limited knowledge of intl issues in today's OSes, 
the algorithm you propose is not feasibly implementable, especially if 
you aim to support a broad range of OSes.  Once again, I would love to 
be proved wrong.

> One would hope, given the existence of Unicode, that we will not be seeing
> new encodings any more (except in specialist fields such as Punycode for
> IDN, but that doesn't really count).

Given Ernest's post about CESU-8, that hope seems highly unwarranted... :(

Received on Tuesday, 17 February 2004 21:59:26 UTC

This archive was generated by hypermail 2.3.1 : Monday, 2 May 2016 14:27:11 UTC