RE: Unicode <-> CJKV national encoding; supporting multi-lingual webcontent from Carl W. Brown on 2001-08-15 (www-international@w3.org from July to September 2001)

From: Carl W. Brown <cbrown@xnetinc.com>
Date: Wed, 15 Aug 2001 14:10:24 -0700
To: <www-international@w3.org>
Message-ID: <FNEHIHOMIIDPDCIFEJEGOEFLCIAA.cbrown@xnetinc.com>
Andrea,

> -----Original Message-----
> From: www-international-request@w3.org
> [mailto:www-international-request@w3.org]On Behalf Of A. Vine
> Sent: Wednesday, August 15, 2001 10:40 AM
> To: www-international@w3.org
> Subject: Re: Unicode <-> CJKV national encoding; supporting
> multi-lingual webcontent
>
>
>
> "Carl W. Brown" wrote:
> >
> > The only way to sanely implement a multi-lingual site is using
> Unicode.  The
> > best support for Unicode is ICU.
> http://oss.software.ibm.com/icu/  If you
>
> or Java :-)

Java has it own set of problems (challenges).  First many people already
have code written in C that they do not want to rewrite.  It is not as easy
to actually get to the underlying Unicode in Java.  C code usually runs
faster.  Then there is the problem of JVM versions and conflicting support.

ICU started with the Java Unicode support and adapted it for C/C++
applications.  http://oss.software.ibm.com/icu/ You will notice that many of
the functions are very Javaesque.

It is a great component library for Unicode.  What I have added is extra
functions that don't really belong in ICU proper.  xIUA unlike ICU is
designed as a sample starting point for code that you develop as part of
your application.  http://www.xnetinc.com/xiua/  While it is designed for
typical applications it is especially useful for web server applications.
For example is adds support for a per-thread set of locales.  You can have
one locale for the browser, one for your HTML pages and one for Your Unicode
database.

You can make calls to transform your data from your HTML charset which may
be EUC-JP to your browser charset which is Shift_JIS.  The same code may
convert the same page to UTF-8 for the next browser.   If you are parsing
data your code can call xiua_strtok and the same call will work for UTF-32,
UTF-16, UTF-8 and code page data.  Unlike the normal strtok it is also
thread safe.

It manages your locale information including time zones using Java style
time zones.  It also has special web functions.  For example it will analyze
a browser accept language string including the q= quality selections and
return the first choice language based on the installed ICU locales.  It
will also analyze a path and return any RFC 3066 language subdirectory name
that is found to match your ICU installed locales.

It also has special migration aids.  It has a routine for example. that will
convert a strftime date time format to an ICU format using the ICU values
from its resource bundles.

It also makes conversion easier because like Java you don't have to pass the
locale to every function that may invoke ICU so that you don't have to
change any existing APIs to convert to Unicode.

This code is really a starting point for users.  It is designed to be
customized by users.  It also has alternative functions.  For example I have
xiua_strcoll that most C programmers can relate to but since the first
implementation was in a special version of PHP, I also have:

int32_t          /* 1 = TRUE, 0 = FALSE, -1 = LOGIC ERROR */
 xiua_Collate(char *str1, /* string 1 */
  char * option, /* option string contains both comparison test */
                 /* and optional collation strength parameters */
                 /* "==" "<=" ">=" "!=" "<" ">" are the */
                 /* comparison test values and "?" ":" "#" are */
                 /* the valid strength codes.  "==?" is a test */
                 /* for equal primary strength. */
                 /* ? = Primary letters match no case or case */
                 /* e.g "Black-bird" ==? "blackbird"  */
                 /* but what consitiute separate letters may differ */
                 /* by locale e.g. Spanish ch ll */
                 /* Secondary case insensitive normalized with accents */
                 /* : = Tertiary above plus case sensitive */
                 /* # = Strict match */
                 /* spaces are ignored, non-standard conditions are */
                 /* supported "!<>" or "=" are the same as "==" */
                 /* "" or "!" however are illogical and are errors */
  char * str2);  /* string 2 */

Because the result is a TRUE/FALSE it is easy to embed the result into a
more complex test or a regular expression.

It also has i18n useful functions such as xiua_strncpyEx.  It works somewhat
like strncpy except that it always adds a null to the end of the target
string and only copies full characters.  So if it is copying UTF-8 or
Shift_JIS data you will always get full characters copied even if it means
that the target buffer is not quite full.  It also always adds a null to the
end which is 4 bytes if it is UTF-32 data.  To make it easier to use it
returns the data length copied.

>
> Also, if you're generating HTML forms, I recommend you take a look at:
>
> http://www.unicode.org/iuc/iuc17/papers.html

Good presentation.  One thing that my software does is help overcome some of
the problems with internationalization that you have with the Apache web
server that you don't have with servers like iPlanet but yet stay iPlanet
compatible.

I think that adding language as a type is an administrative nightmare.
www.mysite.com/dir/subdir/mypage.html.en is a bad idea.  If nothing else it
creates problems keeping track of links.  It is also a maintenance
nightmare.  If I also have www.mysite.com/dir/subdir/mypage.html.jp with is
encoded in sjis mixed in with pages in other encodings.  It is easy to get
your hands crossed.


>
> Look at the presentation under Session A2 (mine ;-) and it looks like the
> presentation under A3 might have some useful information (David Taieb's).
>
> Regards,
> Andrea
>

Carl
Received on Wednesday, 15 August 2001 17:10:58 UTC