Re: Unicode <-> CJKV national encoding; supporting multi-lingualwebcontent

Carl,
Interesting response to a suggestion which is not unreasonable nor far-fetched. 
Comments imbedded:

"Carl W. Brown" wrote:
> 
> Andrea,
> 
> > -----Original Message-----
> > From: www-international-request@w3.org
> > [mailto:www-international-request@w3.org]On Behalf Of A. Vine
> > Sent: Wednesday, August 15, 2001 10:40 AM
> > To: www-international@w3.org
> > Subject: Re: Unicode <-> CJKV national encoding; supporting
> > multi-lingual webcontent
> >
> >
> >
> > "Carl W. Brown" wrote:
> > >
> > > The only way to sanely implement a multi-lingual site is using
> > Unicode.  The
> > > best support for Unicode is ICU.
> > http://oss.software.ibm.com/icu/  If you
> >
> > or Java :-)
> 
> Java has it own set of problems (challenges).  First many people already
> have code written in C that they do not want to rewrite.

No mention was made of whether the code was in C or anything else.  I was simply
suggesting Java.

>  It is not as easy
> to actually get to the underlying Unicode in Java.

Please give an example.

>  C code usually runs
> faster.  Then there is the problem of JVM versions and conflicting support.

Conflicting support?

> 
> ICU started with the Java Unicode support and adapted it for C/C++
> applications.  http://oss.software.ibm.com/icu/ You will notice that many of
> the functions are very Javaesque.

Yup, we here use some of the older code written by Netscape and Taligent before
ICU was created as a C/C++ parallel to the Java functionality.  Of course, there
have been updates.

> 
> It is a great component library for Unicode.  What I have added is extra
> functions that don't really belong in ICU proper.  xIUA unlike ICU is
> designed as a sample starting point for code that you develop as part of
> your application.  http://www.xnetinc.com/xiua/  While it is designed for
> typical applications it is especially useful for web server applications.
> For example is adds support for a per-thread set of locales.  You can have
> one locale for the browser, one for your HTML pages and one for Your Unicode
> database.
> 
> You can make calls to transform your data from your HTML charset which may
> be EUC-JP to your browser charset which is Shift_JIS.  The same code may
> convert the same page to UTF-8 for the next browser.   If you are parsing
> data your code can call xiua_strtok and the same call will work for UTF-32,
> UTF-16, UTF-8 and code page data.  Unlike the normal strtok it is also
> thread safe.
> 
> It manages your locale information including time zones using Java style
> time zones.  It also has special web functions.  For example it will analyze
> a browser accept language string including the q= quality selections and
> return the first choice language based on the installed ICU locales.  It
> will also analyze a path and return any RFC 3066 language subdirectory name
> that is found to match your ICU installed locales.
> 
> It also has special migration aids.  It has a routine for example. that will
> convert a strftime date time format to an ICU format using the ICU values
> from its resource bundles.
> 
> It also makes conversion easier because like Java you don't have to pass the
> locale to every function that may invoke ICU so that you don't have to
> change any existing APIs to convert to Unicode.


Great.  Keep up the good work.  For those writing in Java (apparently a majority
of coders worldwide, according to a recent study), you may have to write this
stuff on your own, or find a compatible Java library.  I wouldn't say that you
had to scrap all your Java code, though.

> 
> This code is really a starting point for users.  It is designed to be
> customized by users.  It also has alternative functions.  For example I have
> xiua_strcoll that most C programmers can relate to but since the first
> implementation was in a special version of PHP, I also have:
> 
> int32_t          /* 1 = TRUE, 0 = FALSE, -1 = LOGIC ERROR */
>  xiua_Collate(char *str1, /* string 1 */
>   char * option, /* option string contains both comparison test */
>                  /* and optional collation strength parameters */
>                  /* "==" "<=" ">=" "!=" "<" ">" are the */
>                  /* comparison test values and "?" ":" "#" are */
>                  /* the valid strength codes.  "==?" is a test */
>                  /* for equal primary strength. */
>                  /* ? = Primary letters match no case or case */
>                  /* e.g "Black-bird" ==? "blackbird"  */
>                  /* but what consitiute separate letters may differ */
>                  /* by locale e.g. Spanish ch ll */
>                  /* Secondary case insensitive normalized with accents */
>                  /* : = Tertiary above plus case sensitive */
>                  /* # = Strict match */
>                  /* spaces are ignored, non-standard conditions are */
>                  /* supported "!<>" or "=" are the same as "==" */
>                  /* "" or "!" however are illogical and are errors */
>   char * str2);  /* string 2 */
> 
> Because the result is a TRUE/FALSE it is easy to embed the result into a
> more complex test or a regular expression.
> 
> It also has i18n useful functions such as xiua_strncpyEx.  It works somewhat
> like strncpy except that it always adds a null to the end of the target
> string and only copies full characters.  So if it is copying UTF-8 or
> Shift_JIS data you will always get full characters copied even if it means
> that the target buffer is not quite full.  It also always adds a null to the
> end which is 4 bytes if it is UTF-32 data.  To make it easier to use it
> returns the data length copied.
> 
> >
> > Also, if you're generating HTML forms, I recommend you take a look at:
> >
> > http://www.unicode.org/iuc/iuc17/papers.html
> 
> Good presentation.  

Born of 2 painful projects, one in Java, one in C.  Somewhat out of date but the
same considerations still hold.

> One thing that my software does is help overcome some of
> the problems with internationalization that you have with the Apache web
> server that you don't have with servers like iPlanet but yet stay iPlanet
> compatible.
> 
> I think that adding language as a type is an administrative nightmare.
> www.mysite.com/dir/subdir/mypage.html.en is a bad idea.  If nothing else it
> creates problems keeping track of links.  It is also a maintenance
> nightmare.  If I also have www.mysite.com/dir/subdir/mypage.html.jp with is
> encoded in sjis mixed in with pages in other encodings.  It is easy to get
> your hands crossed.

Hmm, I never remember saying that.  I assume you're just adding this as advice
to, er, was it Misha?

Andrea

> 
> >
> > Look at the presentation under Session A2 (mine ;-) and it looks like the
> > presentation under A3 might have some useful information (David Taieb's).
> >
> > Regards,
> > Andrea
> >
> 
> Carl

Received on Thursday, 16 August 2001 21:16:02 UTC