W3C home > Mailing lists > Public > www-international@w3.org > July to September 2001

RE: International business communications and Unicode

From: Carl W. Brown <cbrown@xnetinc.com>
Date: Thu, 23 Aug 2001 11:31:24 -0700
To: <www-international@w3.org>
Message-ID: <FNEHIHOMIIDPDCIFEJEGCEIMCIAA.cbrown@xnetinc.com>
Eric,

> >Every page will still be translated into different language but
> > you only have one encoding.
> >
>
> this is the Unicode dream and maybe one day we'll see something like
> it...the actual process at present goes like this...you upload your
> lovely Unicode Japanese site only to find that most Japanese users
> can't access it, they can only access shift-xjis encoded sites...you
> then move on to Russian to discover that most Russian users are
> expecting the language to be encoded with Windows 1251...and don't
> get me on to Chinese
>
> Unicode is utterly wonderful...I love the idea to death...the ethos
> is truly inspiring...the practicality is that Russia, Japan and Hong
> Kong got online before Unicode began...the people of those nations
> will take some shifting from their current methods of representing
> their languages

I have put together a solution.  Yes there are a lot of browsers out there
that do not support Unicode.  Take the case that you have Japanese web pages
encoded in EUC-JP, your database uses UTF-8 for Unicode and your browser is
using Shift_JIS.  You set up a locale for your pages ("ja_JP.EUC-JP") ,
another for your database and another for the browser ("ja_JP.Shift_JIS").
They are thread independent locales.  This program adapts so that you can do
a xiua_strcoll and it will compare two strings using the Japanese collating
order with UTF-32, UTF-16, UTF-8 or code page data.  It can dynamically
switch between different data formats and produce the same result.  If you
do an xiua_strcmp it will also produce the same results for different
Unicode encodings.  So it will adapt to different platforms that use
different Unicode encodings.

It will also transform the data from one encoding to another.  So if you
have a UTF-8 locale and convert it to a Shift_JIS locale it will take care
of converting the data.  If the browser is using EUC-JP encoding and the
HTML is in EUC-JP it will see that the two locales are using the same
encoding so that it will just copy the data.

The whole thing however, works on a Unicode base.  It uses ICU
http://oss.software.ibm.com/icu/ which is probably the most comprehensive
Unicode support package for C/C++ applications.  Some functions like
xiua_strtok must have different implementations for different forms of data
but it is transparent to the user.  My code xIUA
http://www.xnetinc.com/xiua/ provides these alternate implementations as
needed.

Both ICU and xIUA are free open source code so they can be tailored for your
specific needs.  In fact xIUA is starter package that is designed to be part
of your application so you can adapt it for your needs.  It also contains
code they you can use with an Apache web server to organize your web pages
into language specific directories so that it is easier to organize the site
and make links.  This also reduces mishaps because each directory uses the
same code page.  Better yet you can convert all your files to UTF-8 and then
just translate to the code page that the browser needs.  As more browsers
start supporting UTF-8 the translation will become unnecessary.

>
> and if I wish to mix languages on a single page...if I wish to use a
> German or French quote in a passage of English text?...I like the
> broad idea...however over-automisation of language seems to be
> disastrous...people are strange about language...you can see it on
> our site where users seem to leap between languages at particular
> points...a lot of people seem to have different preferred languages
> for collecting information and for dealing with personal matters

I browse pages with mixed script all the time.  Most people want to see only
one language but the site must support multiple languages.

Carl
Received on Thursday, 23 August 2001 14:31:25 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:57 GMT