W3C home > Mailing lists > Public > www-international@w3.org > July to September 2001

Unicode <-> CJKV national encoding; supporting multi-lingual web content

From: Michael Gorelik <mgorelik@Novarra.com>
Date: Tue, 14 Aug 2001 15:35:15 -0500
Message-ID: <3956B7121A30D411850800508B9A5EE035F375@novarrainet1.internalnt.novarra.com>
To: www-international@w3.org
We work on the product that enables wireless devices to access any web
content on the fly. I am in process of evaluating options to enable
multi-language support, and honestly I am ready to jump out of the window:-)

I have several questions, hopefully some one can help me to sort them out:-)

1)Have any one seeing some information on the amount of pages, % of content
available in different charsets, such as ISO8859-1, UTF-8, UTF-16, EUC-JP,
ISO-2022-JP, ShiftJs,etc (except the Babel study). I am trying to get idea
on the number of users of the particular charset.

2)Also, if some one can point out a nice table that list languages,
character repertoire, coded character set, charset, I would be very
grateful. Something like this:
Language		Character Repertoire	Coded Character Set
English		ISO8859-1			ISO8859-1
Japanese		JIS X 0208-1990		shift jis
Of course I am still at a loss which standard defines character repertoire,
which defines, coded character set, which one defines encoding, and which
one defines charset.

3) Probably my most important dilemma is - Can we use Unicode to represent
data internally. Namely is there mapping tables from all the most widely
used charsets in Europe and East Asia into Unicode and back??? If there are
widely used encodings that don't map into Unicode nicely, what are they?

4) What is the set of IANA charsets for CJKV that I need to be able to
handle in my product to lets say support 80-90% of content available in

Thanx in advance:-)
Misha Gorelik
Received on Tuesday, 14 August 2001 16:40:07 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:40:45 UTC