RE: Unicode <-> CJKV national encoding; supporting multi-lingual web content from Carl W. Brown on 2001-08-14 (www-international@w3.org from July to September 2001)

From: Carl W. Brown <cbrown@xnetinc.com>
Date: Tue, 14 Aug 2001 14:27:31 -0700
To: <www-international@w3.org>
Message-ID: <FNEHIHOMIIDPDCIFEJEGKEFBCIAA.cbrown@xnetinc.com>
Misha,

> -----Original Message-----
> From: www-international-request@w3.org
> [mailto:www-international-request@w3.org]On Behalf Of Michael Gorelik
> Sent: Tuesday, August 14, 2001 1:35 PM
> To: www-international@w3.org
> Subject: Unicode <-> CJKV national encoding; supporting multi-lingual
> web content
>
>
> We work on the product that enables wireless devices to access any web
> content on the fly. I am in process of evaluating options to enable
> multi-language support, and honestly I am ready to jump out of
> the window:-)
>
> I have several questions, hopefully some one can help me to sort
> them out:-)
>
> 1)Have any one seeing some information on the amount of pages, %
> of content
> available in different charsets, such as ISO8859-1, UTF-8, UTF-16, EUC-JP,
> ISO-2022-JP, ShiftJs,etc (except the Babel study). I am trying to get idea
> on the number of users of the particular charset.
>
>
> 2)Also, if some one can point out a nice table that list languages,
> character repertoire, coded character set, charset, I would be very
> grateful. Something like this:
> Language		Character Repertoire	Coded Character Set
> charset
> English		ISO8859-1			ISO8859-1
> ISO8859-1
> Japanese		JIS X 0208-1990		shift jis
> shift-jis
>   							iso-2022-jp
> iso-2022-jp
> etc.
> Of course I am still at a loss which standard defines character
> repertoire,
> which defines, coded character set, which one defines encoding, and which
> one defines charset.

It is a bit out of date but still useful
http://www.w3.org/International/O-charset-list.html

Nadine Kano's book is also useful

Also look at the ICU putil.c code

I can send you some tables but a lot if this is platform dependent.  If you
are targeted at browsers the first table is the best.  Not that utf-8
support will not always work with browsers that are not Unicode based.  That
is why Netscape was rewritten for Unicode and if they use Netscape 6.0 they
will get much better utf-8 support.

>
> 3) Probably my most important dilemma is - Can we use Unicode to represent
> data internally. Namely is there mapping tables from all the most widely
> used charsets in Europe and East Asia into Unicode and back??? If
> there are
> widely used encodings that don't map into Unicode nicely, what are they?
>

The only way to sanely implement a multi-lingual site is using Unicode.  The
best support for Unicode is ICU.  http://oss.software.ibm.com/icu/  If you
are developing web server software, I have some open source code that has
added functionality.  It has Apache mime extensions, accept language parsers
etc.  It also allows you to use the same code to process UTF-16, UTF-32,
UTF-8 and code page data.  It will dynamically shift to support date in
these different formats.  This is nice if you have a mix of web pages with
different encodings and have to serve browsers with different character set
requirements and still use a Unicode database.  For example if you call
xiua_strcoll it will compare the two strings using your current locale which
you set for the specific thread and the data in what ever format you are
currently set to use.  It will use ICU's collation logic.  For functions
like xiua_strtok it will use separate implementations for each type of data
format.  However unlike the normal strtok it is thread safe.

> 4) What is the set of IANA charsets for CJKV that I need to be able to
> handle in my product to lets say support 80-90% of content available in
> Asia?

Look at the ICU icu/data/convtrs.txt and it lists character set alias names
and indicates which are the MIME/IANA names.

Carl
Received on Tuesday, 14 August 2001 17:27:40 UTC