- From: A. Vine <avine@eng.sun.com>
- Date: Tue, 14 Aug 2001 14:54:41 -0700
- To: Michael Gorelik <mgorelik@Novarra.com>
- Cc: www-international@w3.org
Misha,
Comments imbedded
Michael Gorelik wrote:
>
> We work on the product that enables wireless devices to access any web
> content on the fly. I am in process of evaluating options to enable
> multi-language support, and honestly I am ready to jump out of the window:-)
Don't jump. Just change jobs.
>
> I have several questions, hopefully some one can help me to sort them out:-)
>
> 1)Have any one seeing some information on the amount of pages, % of content
> available in different charsets, such as ISO8859-1, UTF-8, UTF-16, EUC-JP,
> ISO-2022-JP, ShiftJs,etc (except the Babel study). I am trying to get idea
> on the number of users of the particular charset.
I can't help you there, maybe someone else can.
>
> 2)Also, if some one can point out a nice table that list languages,
> character repertoire, coded character set, charset, I would be very
> grateful. Something like this:
> Language Character Repertoire Coded Character Set
> charset
> English ISO8859-1 ISO8859-1
> ISO8859-1
> Japanese JIS X 0208-1990 shift jis
> shift-jis
> iso-2022-jp
> iso-2022-jp
OK, so you're asking for a table, but just so you know, Shift_JIS isn't a coded
character set (CCS) it's a character encoding scheme (CES) and charset, written
with an underscore. Also, since you're dealing in MIME names, the charset name
for ISO-8859-1 is just that.
> etc.
> Of course I am still at a loss which standard defines character repertoire,
Character repertoires are usually defined by nat'l or int'l standards bodies
and/or in the context of a coded character set.
> which defines, coded character set,
Coded character sets are usually defined by nat'l or int'l standards bodies.
> which one defines encoding, and which
> one defines charset.
Character encoding schemes come out of standards bodies or implementers.
Charsets are registered names of character maps, essentially the name of a
particular implementation of a CES.
>
> 3) Probably my most important dilemma is - Can we use Unicode to represent
> data internally. Namely is there mapping tables from all the most widely
> used charsets in Europe and East Asia into Unicode and back???
Yes.
> If there are
> widely used encodings that don't map into Unicode nicely, what are they?
Er, GB18030 (?) It's so new that I haven't seen a mapping scheme yet. Of
course, it's not widely used yet either, though it is a requirement.
>
> 4) What is the set of IANA charsets for CJKV that I need to be able to
> handle in my product to lets say support 80-90% of content available in
> Asia?
Not in any order:
Shift_JIS
EUC-JP
GB2312 (sometimes called EUC-CN)
GBK
EUC-KR
CNS11643 (sometimes called EUC-TW)
Big5
Most of the above are listed by their preferred MIME names - see the IANA
registry at:
http://www.iana.org/assignments/character-sets
>
> Thanx in advance:-)
> Misha Gorelik
> *;O)
nezashto,
Andrea
Received on Tuesday, 14 August 2001 17:55:56 UTC