RE: Unicode <-> CJKV national encoding; supporting multi-lingualwebcontent from Carl W. Brown on 2001-08-18 (www-international@w3.org from July to September 2001)

From: Carl W. Brown <cbrown@xnetinc.com>
Date: Sat, 18 Aug 2001 09:43:26 -0700
To: <www-international@w3.org>
Message-ID: <FNEHIHOMIIDPDCIFEJEGEEGMCIAA.cbrown@xnetinc.com>
Yung,

> -----Original Message-----
> From: www-international-request@w3.org
> [mailto:www-international-request@w3.org]On Behalf Of Yung-Fong Tang
> Sent: Friday, August 17, 2001 3:57 PM
> To: A. Vine
> Cc: Carl W. Brown; www-international@w3.org
> Subject: Re: Unicode <-> CJKV national encoding; supporting
> multi-lingualwebcontent
>
>
>
> > > > The only way to sanely implement a multi-lingual site is using
> > > Unicode.
> I think you can also use GB18030 these days, if you care about
> GB2312 back ward
> compatability.

I don't expect to see GB18030 to pick up the range of non-Chinese scripts
that Unicode has.  Yes the assignments are automatic but it practice will
they be used?  GB18030 is more awkward to process that most MBCS scripts.
You can not always test the character length from the first character.  It
is also difficult to determine the start of a character when you access a
string randomly.

ICU http://oss.software.ibm.com/icu/ supports GB18030
http://www-106.ibm.com/developerworks/library/u-china.html?dwzone=unicode
but I have not added yet added support in xIUA.
http://www.xnetinc.com/xiua/ Maybe you can help me.  It there a good
algorithm for finding the start of a character without going to the
beginning of the string?

> If you try, you will also find out it is not too hard to port
> Java code back to C++
> code as long as you don't use too many support classes ... .:)
>
It there Java support for GB18030?

>
> "A. Vine" wrote
>
> > Carl,
> > Interesting response to a suggestion which is not unreasonable
> nor far-fetched.
> > Comments imbedded:
> >
> > "Carl W. Brown" wrote:
> > >
> > > Andrea,
> > >
> > > > -----Original Message-----
> > > > From: www-international-request@w3.org
> > > > [mailto:www-international-request@w3.org]On Behalf Of A. Vine
> > > > Sent: Wednesday, August 15, 2001 10:40 AM
> > > > To: www-international@w3.org
> > > > Subject: Re: Unicode <-> CJKV national encoding; supporting
> > > > multi-lingual webcontent
> > > >
> > > >
> > > >
> > > > "Carl W. Brown" wrote:
> > > > >
> > > > > The only way to sanely implement a multi-lingual site is using
> > > > Unicode.  The
> > > > > best support for Unicode is ICU.
> > > > http://oss.software.ibm.com/icu/  If you
> > > >
> > > > or Java :-)
> > >
> > > Java has it own set of problems (challenges).  First many
> people already
> > > have code written in C that they do not want to rewrite.
> >
> > No mention was made of whether the code was in C or anything
> else.  I was simply
> > suggesting Java.
> >
> > >  It is not as easy
> > > to actually get to the underlying Unicode in Java.
> >
> > Please give an example.
> >
> > >  C code usually runs
> > > faster.  Then there is the problem of JVM versions and
> conflicting support.
> >
> > Conflicting support?
> >
> > >
> > > ICU started with the Java Unicode support and adapted it for C/C++
> > > applications.  http://oss.software.ibm.com/icu/ You will
> notice that many of
> > > the functions are very Javaesque.
> >
> > Yup, we here use some of the older code written by Netscape and
> Taligent before
> > ICU was created as a C/C++ parallel to the Java functionality.
> Of course, there
> > have been updates.
> >
> > >
> > > It is a great component library for Unicode.  What I have
> added is extra
> > > functions that don't really belong in ICU proper.  xIUA unlike ICU is
> > > designed as a sample starting point for code that you develop
> as part of
> > > your application.  http://www.xnetinc.com/xiua/  While it is
> designed for
> > > typical applications it is especially useful for web server
> applications.
> > > For example is adds support for a per-thread set of locales.
> You can have
> > > one locale for the browser, one for your HTML pages and one
> for Your Unicode
> > > database.
> > >
> > > You can make calls to transform your data from your HTML
> charset which may
> > > be EUC-JP to your browser charset which is Shift_JIS.  The
> same code may
> > > convert the same page to UTF-8 for the next browser.   If you
> are parsing
> > > data your code can call xiua_strtok and the same call will
> work for UTF-32,
> > > UTF-16, UTF-8 and code page data.  Unlike the normal strtok it is also
> > > thread safe.
> > >
> > > It manages your locale information including time zones using
> Java style
> > > time zones.  It also has special web functions.  For example
> it will analyze
> > > a browser accept language string including the q= quality
> selections and
> > > return the first choice language based on the installed ICU
> locales.  It
> > > will also analyze a path and return any RFC 3066 language
> subdirectory name
> > > that is found to match your ICU installed locales.
> > >
> > > It also has special migration aids.  It has a routine for
> example. that will
> > > convert a strftime date time format to an ICU format using
> the ICU values
> > > from its resource bundles.
> > >
> > > It also makes conversion easier because like Java you don't
> have to pass the
> > > locale to every function that may invoke ICU so that you don't have to
> > > change any existing APIs to convert to Unicode.
> >
> > Great.  Keep up the good work.  For those writing in Java
> (apparently a majority
> > of coders worldwide, according to a recent study), you may have
> to write this
> > stuff on your own, or find a compatible Java library.  I
> wouldn't say that you
> > had to scrap all your Java code, though.
> >
> > >
> > > This code is really a starting point for users.  It is designed to be
> > > customized by users.  It also has alternative functions.  For
> example I have
> > > xiua_strcoll that most C programmers can relate to but since the first
> > > implementation was in a special version of PHP, I also have:
> > >
> > > int32_t          /* 1 = TRUE, 0 = FALSE, -1 = LOGIC ERROR */
> > >  xiua_Collate(char *str1, /* string 1 */
> > >   char * option, /* option string contains both comparison test */
> > >                  /* and optional collation strength parameters */
> > >                  /* "==" "<=" ">=" "!=" "<" ">" are the */
> > >                  /* comparison test values and "?" ":" "#" are */
> > >                  /* the valid strength codes.  "==?" is a test */
> > >                  /* for equal primary strength. */
> > >                  /* ? = Primary letters match no case or case */
> > >                  /* e.g "Black-bird" ==? "blackbird"  */
> > >                  /* but what consitiute separate letters may differ */
> > >                  /* by locale e.g. Spanish ch ll */
> > >                  /* Secondary case insensitive normalized
> with accents */
> > >                  /* : = Tertiary above plus case sensitive */
> > >                  /* # = Strict match */
> > >                  /* spaces are ignored, non-standard conditions are */
> > >                  /* supported "!<>" or "=" are the same as "==" */
> > >                  /* "" or "!" however are illogical and are errors */
> > >   char * str2);  /* string 2 */
> > >
> > > Because the result is a TRUE/FALSE it is easy to embed the
> result into a
> > > more complex test or a regular expression.
> > >
> > > It also has i18n useful functions such as xiua_strncpyEx.  It
> works somewhat
> > > like strncpy except that it always adds a null to the end of
> the target
> > > string and only copies full characters.  So if it is copying UTF-8 or
> > > Shift_JIS data you will always get full characters copied
> even if it means
> > > that the target buffer is not quite full.  It also always
> adds a null to the
> > > end which is 4 bytes if it is UTF-32 data.  To make it easier
> to use it
> > > returns the data length copied.
> > >
> > > >
> > > > Also, if you're generating HTML forms, I recommend you take
> a look at:
> > > >
> > > > http://www.unicode.org/iuc/iuc17/papers.html
> > >
> > > Good presentation.
> >
> > Born of 2 painful projects, one in Java, one in C.  Somewhat
> out of date but the
> > same considerations still hold.
> >
> > > One thing that my software does is help overcome some of
> > > the problems with internationalization that you have with the
> Apache web
> > > server that you don't have with servers like iPlanet but yet
> stay iPlanet
> > > compatible.
> > >
> > > I think that adding language as a type is an administrative nightmare.
> > > www.mysite.com/dir/subdir/mypage.html.en is a bad idea.  If
> nothing else it
> > > creates problems keeping track of links.  It is also a maintenance
> > > nightmare.  If I also have
> www.mysite.com/dir/subdir/mypage.html.jp with is
> > > encoded in sjis mixed in with pages in other encodings.  It
> is easy to get
> > > your hands crossed.
> >
> > Hmm, I never remember saying that.  I assume you're just adding
> this as advice
> > to, er, was it Misha?
> >
> > Andrea
> >
> > >
> > > >
> > > > Look at the presentation under Session A2 (mine ;-) and it
> looks like the
> > > > presentation under A3 might have some useful information
> (David Taieb's).
> > > >
> > > > Regards,
> > > > Andrea
> > > >
> > >
> > > Carl
>
Received on Saturday, 18 August 2001 12:43:32 UTC