- From: Chris Lilley <Chris.Lilley@sophia.inria.fr>
- Date: Fri, 6 Dec 1996 10:59:59 +0100 (MET)
- To: Chris Wendt <christw@MICROSOFT.com>, "'garym@softshore.com.au '" <garym@softshore.com.au>, "'www-international@w3.org '" <www-international@w3.org>
- Cc: "'Alan Barrett/DUB/Lotus'" <Alan_Barrett/DUB/Lotus.LOTUSINT@crd.lotus.com>
On Dec 5, 4:53pm, Chris Wendt wrote: > A "good" browser would need to send all or most of the charsets listed > in > ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets > > Easy to take all the names, concatenate them and measure the size. Right. bash$ grep Name character-sets.txt > chara.log bash$ cat chara.log | awk '{print $2}' | grep -v Name > chara2.log bash$ wc chara2.log 210 210 2611 chara2.log (ie 210 words, 210 lines (210 charset names) and 2611 bytes To which should be added 209 commas and 209 spaces but taking away 209 linefeeds that gives 2820 bytes. > I didn't do it but instead looked at the size of the IANA document > itself which is 44282 bytes. > Assuming the text contains 2/3 overhead, the accept-charset string would > be 14 KB, attached to every GET. So, that turns out to be an over-estimate - provided only the preferred registered names are used, not the synonyms. However, for example, ANSI_X3.4-1968 is more commonly referred to as US-ASCII The actual list would be shorter because few browsers would want to say that they can take Ventura-US, EBCDIC-UK or UNKNOWN-8BIT. Also, some of the names are pretty long - Extended_UNIX_Code_Packed_Format_for_Japanese is a particularly bloated example. There are also several national variants of 646-IRV whose character repertoire is included in Latin-1 so there is little reason to use those. Some of the reference sources - VAX/VMS User's Manual, LaserJet IIP Printer User's Manual, PCL 5 - make it unlikely that these charsets are used on the Web. There are some possible duplicates or outdated charsets, for example Name: macintosh [RFC1345,KXS2] MIBenum: 2027 Source: The Unicode Standard ver1.0, ISBN 0-201-56788-1, Oct 1991 Alias: mac Alias: csMacintosh is presumably contained within unicode-1-1 and ISO-10646-UTF-1 is obselete. There are 40 variants of IBM EBCIDIC listed. It looks like a survey of a) what is actually used on the Web now b) what gaps there are (the nine Indian charsets, for example), are reasonable compact alternatives to UTF-8) c) what should be the recommended (level 1 and level 2) charsets would be valuable. How about we collaboratively put together such a list? -- Chris Lilley, W3C [ http://www.w3.org/ ] Graphics and Fonts Guy The World Wide Web Consortium http://www.w3.org/people/chris/ INRIA, Projet W3C chris@w3.org 2004 Rt des Lucioles / BP 93 +33 (0)4 93 65 79 87 06902 Sophia Antipolis Cedex, France
Received on Friday, 6 December 1996 05:00:05 UTC