- From: Carl W. Brown <cbrown@xnetinc.com>
- Date: Tue, 14 Aug 2001 14:27:31 -0700
- To: <www-international@w3.org>
Misha, > -----Original Message----- > From: www-international-request@w3.org > [mailto:www-international-request@w3.org]On Behalf Of Michael Gorelik > Sent: Tuesday, August 14, 2001 1:35 PM > To: www-international@w3.org > Subject: Unicode <-> CJKV national encoding; supporting multi-lingual > web content > > > We work on the product that enables wireless devices to access any web > content on the fly. I am in process of evaluating options to enable > multi-language support, and honestly I am ready to jump out of > the window:-) > > I have several questions, hopefully some one can help me to sort > them out:-) > > 1)Have any one seeing some information on the amount of pages, % > of content > available in different charsets, such as ISO8859-1, UTF-8, UTF-16, EUC-JP, > ISO-2022-JP, ShiftJs,etc (except the Babel study). I am trying to get idea > on the number of users of the particular charset. > > > 2)Also, if some one can point out a nice table that list languages, > character repertoire, coded character set, charset, I would be very > grateful. Something like this: > Language Character Repertoire Coded Character Set > charset > English ISO8859-1 ISO8859-1 > ISO8859-1 > Japanese JIS X 0208-1990 shift jis > shift-jis > iso-2022-jp > iso-2022-jp > etc. > Of course I am still at a loss which standard defines character > repertoire, > which defines, coded character set, which one defines encoding, and which > one defines charset. It is a bit out of date but still useful http://www.w3.org/International/O-charset-list.html Nadine Kano's book is also useful Also look at the ICU putil.c code I can send you some tables but a lot if this is platform dependent. If you are targeted at browsers the first table is the best. Not that utf-8 support will not always work with browsers that are not Unicode based. That is why Netscape was rewritten for Unicode and if they use Netscape 6.0 they will get much better utf-8 support. > > 3) Probably my most important dilemma is - Can we use Unicode to represent > data internally. Namely is there mapping tables from all the most widely > used charsets in Europe and East Asia into Unicode and back??? If > there are > widely used encodings that don't map into Unicode nicely, what are they? > The only way to sanely implement a multi-lingual site is using Unicode. The best support for Unicode is ICU. http://oss.software.ibm.com/icu/ If you are developing web server software, I have some open source code that has added functionality. It has Apache mime extensions, accept language parsers etc. It also allows you to use the same code to process UTF-16, UTF-32, UTF-8 and code page data. It will dynamically shift to support date in these different formats. This is nice if you have a mix of web pages with different encodings and have to serve browsers with different character set requirements and still use a Unicode database. For example if you call xiua_strcoll it will compare the two strings using your current locale which you set for the specific thread and the data in what ever format you are currently set to use. It will use ICU's collation logic. For functions like xiua_strtok it will use separate implementations for each type of data format. However unlike the normal strtok it is thread safe. > 4) What is the set of IANA charsets for CJKV that I need to be able to > handle in my product to lets say support 80-90% of content available in > Asia? Look at the ICU icu/data/convtrs.txt and it lists character set alias names and indicates which are the MIME/IANA names. Carl
Received on Tuesday, 14 August 2001 17:27:40 UTC