Re: Accept-Charset support from Chris Lilley on 1996-12-06 (www-international@w3.org from October to December 1996)

From: Chris Lilley <Chris.Lilley@sophia.inria.fr>
Date: Fri, 6 Dec 1996 10:59:59 +0100 (MET)
To: Chris Wendt <christw@MICROSOFT.com>, "'garym@softshore.com.au '" <garym@softshore.com.au>, "'www-international@w3.org '" <www-international@w3.org>
Cc: "'Alan Barrett/DUB/Lotus'" <Alan_Barrett/DUB/Lotus.LOTUSINT@crd.lotus.com>
Message-Id: <9612061059.ZM23822@grommit.inria.fr>

On Dec 5,  4:53pm, Chris Wendt wrote:

> A "good" browser would need to send all or most of the charsets listed
> in
> ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
>
> Easy to take all the names, concatenate them and measure the size.

Right.

bash$ grep Name character-sets.txt > chara.log
bash$ cat chara.log | awk '{print $2}' | grep -v Name > chara2.log
bash$ wc chara2.log
       210       210    2611 chara2.log
(ie 210 words, 210 lines (210 charset names) and 2611 bytes

To which should be added 209 commas and 209 spaces but taking away
209 linefeeds that gives 2820 bytes.

> I didn't do it but instead looked at the size of the IANA document
> itself which is 44282 bytes.
> Assuming the text contains 2/3 overhead, the accept-charset string would
> be 14 KB, attached to every GET.

So, that turns out to be an over-estimate - provided only the preferred
registered names are used, not the synonyms. However, for example,
ANSI_X3.4-1968 is more commonly referred to as US-ASCII

The actual list would be shorter because few browsers would want to
say that they can take Ventura-US, EBCDIC-UK or UNKNOWN-8BIT. Also,
some of the names are pretty long -
Extended_UNIX_Code_Packed_Format_for_Japanese
is a particularly bloated example.

There are also several national variants of 646-IRV whose character
repertoire is included in Latin-1 so there is little reason to use
those. Some of the reference sources - VAX/VMS User's Manual, LaserJet
IIP Printer User's Manual, PCL 5  - make it unlikely that these charsets
are used on the Web.

There are some possible duplicates or outdated charsets, for example

Name: macintosh                                           [RFC1345,KXS2]
MIBenum: 2027
Source: The Unicode Standard ver1.0, ISBN 0-201-56788-1, Oct 1991
Alias: mac
Alias: csMacintosh

is presumably contained within unicode-1-1 and ISO-10646-UTF-1 is
obselete.

There are 40 variants of IBM EBCIDIC listed.

It looks like a survey of

a) what is actually used on the Web now
b) what gaps there are (the nine Indian charsets, for example), are
   reasonable compact alternatives to UTF-8)
c) what should be the recommended (level 1 and level 2) charsets

would be valuable. How about we collaboratively put together such a list?

-- 
Chris Lilley, W3C                          [ http://www.w3.org/ ]
Graphics and Fonts Guy            The World Wide Web Consortium
http://www.w3.org/people/chris/              INRIA,  Projet W3C
chris@w3.org                       2004 Rt des Lucioles / BP 93
+33 (0)4 93 65 79 87       06902 Sophia Antipolis Cedex, France

Received on Friday, 6 December 1996 05:00:05 UTC