RE: utf-8 Locale support on Solaris and Linux from Carl W. Brown on 2001-09-28 (www-international@w3.org from July to September 2001)

From: Carl W. Brown <cbrown@xnetinc.com>
Date: Fri, 28 Sep 2001 15:02:26 -0700
To: <www-international@w3.org>
Message-ID: <FNEHIHOMIIDPDCIFEJEGCENBCJAA.cbrown@xnetinc.com>
Merel,

> -----Original Message-----
> From: www-international-request@w3.org
> [mailto:www-international-request@w3.org]On Behalf Of Merle Tenney
> Sent: Friday, September 28, 2001 1:21 PM
> To: 'Carl W. Brown'; www-international@w3.org
> Subject: RE: utf-8 Locale support on Solaris and Linux
>
>
> Carl,
>
> > > 3.  Most ICU interfaces do not take UTF-8 strings but rather UTF-16
> > > strings (which are 16-bits wide).
> >
> > ICU has some macros for UTF-8 support but you have to look at them
> > carefully.  They were added to ICU because they do not add to
> > the code size.
> > They are not a complete UTF-8 support package.  There are
> > some that I use
> > but others can get you into trouble.  We just had a
> > discussion on the use of
> > such macros to count the number of characters in a string.
> > There are two
> > classes of support macros SAFE and UNSAFE.  The SAFE validate
> > the data and
> > the UNSAFE which run faster do not.  Using either macro in a
> > routine will
> > produce a bad count if the data is bad.  The count may differ
> > depending on
> > the choice but neither will give you any indication that the
> > count is wrong.
> >
> > In my humble opinion, you are better off implementing your own
> > routines for
> > many of these functions.  They can be faster and be more
> > reliable.  This is
> > the one area they I feel that ICU would have been better off
> > in just not
> > trying to do a half done job.  In all other areas the ICU
> > code is top notch.
>
> This is good advice, but it leads naturally to another question:  Why
> doesn't ICU have a branch that provides equivalent support to the existing
> code, but for text encoded in UTF-8?  I know that you can convert easily
> between UTF-8 and UTF-16, but you really want to have a system that is
> designed, optimized, and tested for your native encoding.

I believe that if you want to do something like collation that it is more
efficient to convert the UTF-8 into UTF-16 or UTF-32 because of the intense
character handling.  The same applies to case shifting as well but because
of the double transformation and less character handling I suspect that it
probably is closer to a draw.  Other routines like strstr and strtok must
have separate implementations for UTF-8, UTF-16 & UTF-32.

I use a stateless fast transform that is very efficient with lots of little
fields so the overhead is minimal.  It is also totally transparent to the
user.

> There
> are a *lot*
> of Unicode implementations that will be based on UTF-8, so I don't think
> this is an unusual request.  Has this been considered before?

I started on the original code that eventually became xIUA in Feb 2000.  One
of the first hinks I added was UTF-8 support for ICU.


> Would it take
> a lot of work to complement the existing ICU libraries with native UTF-8
> versions and maintain them in parallel?

Seen the growing interest in UTF-8 was one of the factors that lead me to
make xIUA available to the public as a cross platform free open source
addition to ICU.  You can just use the UTF-8 support out of it but I found
that web applications were especially tricky in that they could both use
UTF-8 as a Unicode encoding and as a code page.  To make that easy for
programmers it needed a little extra infrastructure.

For example my browser may use any character set for one of many code page
to UTF-8.  I need an application that has transparent code page MBCS support
and UTF-8 support.  So I can use iso-5589-1 to browse a site and if it
contains not translatable characters such as the Euro character that they
will be automatically inserted in the text as NCRs and the same code will
see that if the browser is using a UTF-8 code page and the database is UTF-8
then it just has to copy the data.

I also throw in the code that will check the browser accept language, a
routine to convert it to locale, one to take the accept char set and pick
the best character set to build a full locale for the browser.  Code like
this needs to be customized to the application in fact xIUA is designed to
be integrated into the application.

You next point was how do you maintain this?  In a year and a half I have
only made minor changes.  I tightened up the UTF-8 validation to meet
Unicode 3.1 changes and I had to change the collation interface because of
changes in ICU between 1.6 and 1.8.  The other change I made was I changed
the string null termination handling to be more compatible with the new ICU
2.0 standards.   In a year and a half there were only a few lines of code
that needed to be changed.  The other changes were improvements.  By running
the xIUA test program with a new ICU release I can find out very quickly if
I need to change anything and exactly what needs changing.

Carl
Received on Friday, 28 September 2001 18:02:31 UTC