- From: A. Vine <avine@eng.sun.com>
- Date: Tue, 14 Aug 2001 14:54:41 -0700
- To: Michael Gorelik <mgorelik@Novarra.com>
- Cc: www-international@w3.org
Misha, Comments imbedded Michael Gorelik wrote: > > We work on the product that enables wireless devices to access any web > content on the fly. I am in process of evaluating options to enable > multi-language support, and honestly I am ready to jump out of the window:-) Don't jump. Just change jobs. > > I have several questions, hopefully some one can help me to sort them out:-) > > 1)Have any one seeing some information on the amount of pages, % of content > available in different charsets, such as ISO8859-1, UTF-8, UTF-16, EUC-JP, > ISO-2022-JP, ShiftJs,etc (except the Babel study). I am trying to get idea > on the number of users of the particular charset. I can't help you there, maybe someone else can. > > 2)Also, if some one can point out a nice table that list languages, > character repertoire, coded character set, charset, I would be very > grateful. Something like this: > Language Character Repertoire Coded Character Set > charset > English ISO8859-1 ISO8859-1 > ISO8859-1 > Japanese JIS X 0208-1990 shift jis > shift-jis > iso-2022-jp > iso-2022-jp OK, so you're asking for a table, but just so you know, Shift_JIS isn't a coded character set (CCS) it's a character encoding scheme (CES) and charset, written with an underscore. Also, since you're dealing in MIME names, the charset name for ISO-8859-1 is just that. > etc. > Of course I am still at a loss which standard defines character repertoire, Character repertoires are usually defined by nat'l or int'l standards bodies and/or in the context of a coded character set. > which defines, coded character set, Coded character sets are usually defined by nat'l or int'l standards bodies. > which one defines encoding, and which > one defines charset. Character encoding schemes come out of standards bodies or implementers. Charsets are registered names of character maps, essentially the name of a particular implementation of a CES. > > 3) Probably my most important dilemma is - Can we use Unicode to represent > data internally. Namely is there mapping tables from all the most widely > used charsets in Europe and East Asia into Unicode and back??? Yes. > If there are > widely used encodings that don't map into Unicode nicely, what are they? Er, GB18030 (?) It's so new that I haven't seen a mapping scheme yet. Of course, it's not widely used yet either, though it is a requirement. > > 4) What is the set of IANA charsets for CJKV that I need to be able to > handle in my product to lets say support 80-90% of content available in > Asia? Not in any order: Shift_JIS EUC-JP GB2312 (sometimes called EUC-CN) GBK EUC-KR CNS11643 (sometimes called EUC-TW) Big5 Most of the above are listed by their preferred MIME names - see the IANA registry at: http://www.iana.org/assignments/character-sets > > Thanx in advance:-) > Misha Gorelik > *;O) nezashto, Andrea
Received on Tuesday, 14 August 2001 17:55:56 UTC