Re: Unicode <-> CJKV national encoding; supporting multi-lingual webcontent from A. Vine on 2001-08-14 (www-international@w3.org from July to September 2001)

From: A. Vine <avine@eng.sun.com>
Date: Tue, 14 Aug 2001 14:54:41 -0700
To: Michael Gorelik <mgorelik@Novarra.com>
Cc: www-international@w3.org
Message-id: <3B799E21.967D77DD@eng.sun.com>

Misha,
Comments imbedded

Michael Gorelik wrote:
> 
> We work on the product that enables wireless devices to access any web
> content on the fly. I am in process of evaluating options to enable
> multi-language support, and honestly I am ready to jump out of the window:-)

Don't jump.   Just change jobs.

> 
> I have several questions, hopefully some one can help me to sort them out:-)
> 
> 1)Have any one seeing some information on the amount of pages, % of content
> available in different charsets, such as ISO8859-1, UTF-8, UTF-16, EUC-JP,
> ISO-2022-JP, ShiftJs,etc (except the Babel study). I am trying to get idea
> on the number of users of the particular charset.

I can't help you there, maybe someone else can.

> 
> 2)Also, if some one can point out a nice table that list languages,
> character repertoire, coded character set, charset, I would be very
> grateful. Something like this:
> Language                Character Repertoire    Coded Character Set
> charset
> English         ISO8859-1                       ISO8859-1
> ISO8859-1
> Japanese                JIS X 0208-1990         shift jis
> shift-jis
>                                                         iso-2022-jp
> iso-2022-jp

OK, so you're asking for a table, but just so you know, Shift_JIS isn't a coded
character set (CCS) it's a character encoding scheme (CES) and charset, written
with an underscore.   Also, since you're dealing in MIME names, the charset name
for ISO-8859-1 is just that.

> etc.
> Of course I am still at a loss which standard defines character repertoire,

Character repertoires are usually defined by nat'l or int'l standards bodies
and/or in the context of a coded character set.

> which defines, coded character set, 

Coded character sets are usually defined by nat'l or int'l standards bodies.

> which one defines encoding, and which
> one defines charset.

Character encoding schemes come out of standards bodies or implementers.
Charsets are registered names of character maps, essentially the name of a
particular implementation of a CES.

> 
> 3) Probably my most important dilemma is - Can we use Unicode to represent
> data internally. Namely is there mapping tables from all the most widely
> used charsets in Europe and East Asia into Unicode and back???

Yes.

> If there are
> widely used encodings that don't map into Unicode nicely, what are they?

Er, GB18030 (?)  It's so new that I haven't seen a mapping scheme yet.  Of
course, it's not widely used yet either, though it is a requirement.

> 
> 4) What is the set of IANA charsets for CJKV that I need to be able to
> handle in my product to lets say support 80-90% of content available in
> Asia?

Not in any order:

Shift_JIS
EUC-JP
GB2312 (sometimes called EUC-CN)
GBK
EUC-KR
CNS11643 (sometimes called EUC-TW)
Big5

Most of the above are listed by their preferred MIME names - see the IANA
registry at:
     http://www.iana.org/assignments/character-sets

> 
> Thanx in advance:-)
> Misha Gorelik
> *;O)

nezashto,
Andrea

Received on Tuesday, 14 August 2001 17:55:56 UTC