Unicode <-> CJKV national encoding; supporting multi-lingual web content

We work on the product that enables wireless devices to access any web
content on the fly. I am in process of evaluating options to enable
multi-language support, and honestly I am ready to jump out of the window:-)

I have several questions, hopefully some one can help me to sort them out:-)

1)Have any one seeing some information on the amount of pages, % of content
available in different charsets, such as ISO8859-1, UTF-8, UTF-16, EUC-JP,
ISO-2022-JP, ShiftJs,etc (except the Babel study). I am trying to get idea
on the number of users of the particular charset.


2)Also, if some one can point out a nice table that list languages,
character repertoire, coded character set, charset, I would be very
grateful. Something like this:
Language		Character Repertoire	Coded Character Set
charset
English		ISO8859-1			ISO8859-1
ISO8859-1
Japanese		JIS X 0208-1990		shift jis
shift-jis
  							iso-2022-jp
iso-2022-jp
etc.
Of course I am still at a loss which standard defines character repertoire,
which defines, coded character set, which one defines encoding, and which
one defines charset.

3) Probably my most important dilemma is - Can we use Unicode to represent
data internally. Namely is there mapping tables from all the most widely
used charsets in Europe and East Asia into Unicode and back??? If there are
widely used encodings that don't map into Unicode nicely, what are they?

4) What is the set of IANA charsets for CJKV that I need to be able to
handle in my product to lets say support 80-90% of content available in
Asia?

Thanx in advance:-)
Misha Gorelik
*;O)

Received on Tuesday, 14 August 2001 16:40:07 UTC