Re: [CSS21] response to issue 115 (and 44) from David Woolley on 2004-02-22 (www-style@w3.org from February 2004)

From: David Woolley <david@djwhome.demon.co.uk>
Date: Sun, 22 Feb 2004 09:09:43 +0000 (GMT)
To: www-style@w3.org
Message-Id: <200402220909.i1M99hW00276@djwhome.demon.co.uk>

> dependent, but I wouldn't mind it trying.  Trying to auto detect when
> the result is valid ISO-8859-1 (or whatever the default document
> character encoding is for that type of document strikes me as
> arrogant, especially since I can't imagine why anyone would want

It is necessary if autodetect is to be of use in the real world,
although, as noted before, the lack of maintenance of this feature,
in IE, suggests it is not widely used.  In any case, general autodetect
is normally an option that the user has to enable.

Both big5 and gb2312 use only the same byte values as iso-8859-1.  Whilst
you could reject Windows 1252 if it contains bytes which are not in the
iso-8859-1 subset (and the distinction is irrelevant otherwise), you can't
tell between gb2312 and iso-8859-1 without looking at the statistics of
the data (probably simple frequencies and digram frequencies[1]).  As far
as I know all auto-detect features use such statistics and they will
get things wrong for pathological cases.  (Some do require extra
hints.)

Again, as noted before, most users are only interested in one one
character set, plus possibly ISO 646/INV (i.e. ASCII), which is
a subset or unshifted variant of most others used in practice on
the web, so the simpler algorithm of using a fixed, but selectable,
character set, when none is specified, normally works for them.

[1] used of the bytes, rather than of the characters.

Received on Sunday, 22 February 2004 04:09:47 UTC