Re: Unicode <-> CJKV national encoding; supporting multi-lingualwebcontent from Yung-Fong Tang on 2001-08-17 (www-international@w3.org from July to September 2001)

From: Yung-Fong Tang <ftang@netscape.com>
Date: Fri, 17 Aug 2001 15:56:37 -0700
To: "A. Vine" <avine@eng.sun.com>
CC: "Carl W. Brown" <cbrown@xnetinc.com>, www-international@w3.org
Message-ID: <3B7DA124.6A335E8B@netscape.com>
> > > The only way to sanely implement a multi-lingual site is using
> > Unicode.
I think you can also use GB18030 these days, if you care about GB2312 back ward
compatability.
If you try, you will also find out it is not too hard to port Java code back to C++
code as long as you don't use too many support classes ... .:)


"A. Vine" wrote

> Carl,
> Interesting response to a suggestion which is not unreasonable nor far-fetched.
> Comments imbedded:
>
> "Carl W. Brown" wrote:
> >
> > Andrea,
> >
> > > -----Original Message-----
> > > From: www-international-request@w3.org
> > > [mailto:www-international-request@w3.org]On Behalf Of A. Vine
> > > Sent: Wednesday, August 15, 2001 10:40 AM
> > > To: www-international@w3.org
> > > Subject: Re: Unicode <-> CJKV national encoding; supporting
> > > multi-lingual webcontent
> > >
> > >
> > >
> > > "Carl W. Brown" wrote:
> > > >
> > > > The only way to sanely implement a multi-lingual site is using
> > > Unicode.  The
> > > > best support for Unicode is ICU.
> > > http://oss.software.ibm.com/icu/  If you
> > >
> > > or Java :-)
> >
> > Java has it own set of problems (challenges).  First many people already
> > have code written in C that they do not want to rewrite.
>
> No mention was made of whether the code was in C or anything else.  I was simply
> suggesting Java.
>
> >  It is not as easy
> > to actually get to the underlying Unicode in Java.
>
> Please give an example.
>
> >  C code usually runs
> > faster.  Then there is the problem of JVM versions and conflicting support.
>
> Conflicting support?
>
> >
> > ICU started with the Java Unicode support and adapted it for C/C++
> > applications.  http://oss.software.ibm.com/icu/ You will notice that many of
> > the functions are very Javaesque.
>
> Yup, we here use some of the older code written by Netscape and Taligent before
> ICU was created as a C/C++ parallel to the Java functionality.  Of course, there
> have been updates.
>
> >
> > It is a great component library for Unicode.  What I have added is extra
> > functions that don't really belong in ICU proper.  xIUA unlike ICU is
> > designed as a sample starting point for code that you develop as part of
> > your application.  http://www.xnetinc.com/xiua/  While it is designed for
> > typical applications it is especially useful for web server applications.
> > For example is adds support for a per-thread set of locales.  You can have
> > one locale for the browser, one for your HTML pages and one for Your Unicode
> > database.
> >
> > You can make calls to transform your data from your HTML charset which may
> > be EUC-JP to your browser charset which is Shift_JIS.  The same code may
> > convert the same page to UTF-8 for the next browser.   If you are parsing
> > data your code can call xiua_strtok and the same call will work for UTF-32,
> > UTF-16, UTF-8 and code page data.  Unlike the normal strtok it is also
> > thread safe.
> >
> > It manages your locale information including time zones using Java style
> > time zones.  It also has special web functions.  For example it will analyze
> > a browser accept language string including the q= quality selections and
> > return the first choice language based on the installed ICU locales.  It
> > will also analyze a path and return any RFC 3066 language subdirectory name
> > that is found to match your ICU installed locales.
> >
> > It also has special migration aids.  It has a routine for example. that will
> > convert a strftime date time format to an ICU format using the ICU values
> > from its resource bundles.
> >
> > It also makes conversion easier because like Java you don't have to pass the
> > locale to every function that may invoke ICU so that you don't have to
> > change any existing APIs to convert to Unicode.
>
> Great.  Keep up the good work.  For those writing in Java (apparently a majority
> of coders worldwide, according to a recent study), you may have to write this
> stuff on your own, or find a compatible Java library.  I wouldn't say that you
> had to scrap all your Java code, though.
>
> >
> > This code is really a starting point for users.  It is designed to be
> > customized by users.  It also has alternative functions.  For example I have
> > xiua_strcoll that most C programmers can relate to but since the first
> > implementation was in a special version of PHP, I also have:
> >
> > int32_t          /* 1 = TRUE, 0 = FALSE, -1 = LOGIC ERROR */
> >  xiua_Collate(char *str1, /* string 1 */
> >   char * option, /* option string contains both comparison test */
> >                  /* and optional collation strength parameters */
> >                  /* "==" "<=" ">=" "!=" "<" ">" are the */
> >                  /* comparison test values and "?" ":" "#" are */
> >                  /* the valid strength codes.  "==?" is a test */
> >                  /* for equal primary strength. */
> >                  /* ? = Primary letters match no case or case */
> >                  /* e.g "Black-bird" ==? "blackbird"  */
> >                  /* but what consitiute separate letters may differ */
> >                  /* by locale e.g. Spanish ch ll */
> >                  /* Secondary case insensitive normalized with accents */
> >                  /* : = Tertiary above plus case sensitive */
> >                  /* # = Strict match */
> >                  /* spaces are ignored, non-standard conditions are */
> >                  /* supported "!<>" or "=" are the same as "==" */
> >                  /* "" or "!" however are illogical and are errors */
> >   char * str2);  /* string 2 */
> >
> > Because the result is a TRUE/FALSE it is easy to embed the result into a
> > more complex test or a regular expression.
> >
> > It also has i18n useful functions such as xiua_strncpyEx.  It works somewhat
> > like strncpy except that it always adds a null to the end of the target
> > string and only copies full characters.  So if it is copying UTF-8 or
> > Shift_JIS data you will always get full characters copied even if it means
> > that the target buffer is not quite full.  It also always adds a null to the
> > end which is 4 bytes if it is UTF-32 data.  To make it easier to use it
> > returns the data length copied.
> >
> > >
> > > Also, if you're generating HTML forms, I recommend you take a look at:
> > >
> > > http://www.unicode.org/iuc/iuc17/papers.html
> >
> > Good presentation.
>
> Born of 2 painful projects, one in Java, one in C.  Somewhat out of date but the
> same considerations still hold.
>
> > One thing that my software does is help overcome some of
> > the problems with internationalization that you have with the Apache web
> > server that you don't have with servers like iPlanet but yet stay iPlanet
> > compatible.
> >
> > I think that adding language as a type is an administrative nightmare.
> > www.mysite.com/dir/subdir/mypage.html.en is a bad idea.  If nothing else it
> > creates problems keeping track of links.  It is also a maintenance
> > nightmare.  If I also have www.mysite.com/dir/subdir/mypage.html.jp with is
> > encoded in sjis mixed in with pages in other encodings.  It is easy to get
> > your hands crossed.
>
> Hmm, I never remember saying that.  I assume you're just adding this as advice
> to, er, was it Misha?
>
> Andrea
>
> >
> > >
> > > Look at the presentation under Session A2 (mine ;-) and it looks like the
> > > presentation under A3 might have some useful information (David Taieb's).
> > >
> > > Regards,
> > > Andrea
> > >
> >
> > Carl
Received on Friday, 17 August 2001 19:27:43 UTC