RE: utf-8 Locale support on Solaris and Linux

Paul

> -----Original Message-----
> From: Paul Deuter [mailto:Paul.Deuter@plumtree.com]
> Sent: Friday, September 28, 2001 9:57 AM
> To: Richard, Francois M; www-international@w3.org
> Cc: Carl W. Brown
> Subject: RE: utf-8 Locale support on Solaris and Linux
>
>
> No one can make tradeoff judgments for you, so I won't even try.
>
> However there are some facts which you should know:
>
> 1.  ICU is C/C++ open source code and therefore should work on any
> system.

ICU contains specific support to make it platform independent.  I has
support for most Unix platforms, Windows, Mac, AS400, S390 etc.  I think
what you mean is that since it is open source you can also adapt it to new
platforms.  That is true and you can contribute you change back to the base
code so that future releases will contain the special platform support.


>
> 2.  UTF-8 is a MBCS where each character can be composed of 1-4 octets.
> Therefore you do not use wide characters with UTF-8.  (Note: if you use
> UTF-8, you should learn it.  It only takes a few minutes to understand
> the encoding - it is very simple and quite beautiful too.  There are
> lots of references on the web.)

Hot off the presses is Ken Lunde's article on UTF-8.  This has the updates
that include the standard changes for Unicode 3.1.
http://www-106.ibm.com/developerworks/unicode/library/u-encode.html
>
> 3.  Most ICU interfaces do not take UTF-8 strings but rather UTF-16
> strings (which are 16-bits wide).

ICU has some macros for UTF-8 support but you have to look at them
carefully.  They were added to ICU because they do not add to the code size.
They are not a complete UTF-8 support package.  There are some that I use
but others can get you into trouble.  We just had a discussion on the use of
such macros to count the number of characters in a string.  There are two
classes of support macros SAFE and UNSAFE.  The SAFE validate the data and
the UNSAFE which run faster do not.  Using either macro in a routine will
produce a bad count if the data is bad.  The count may differ depending on
the choice but neither will give you any indication that the count is wrong.

I my humble opinion, you are better off implementing your own routines for
many of these functions.  They can be faster and be more reliable.  This is
the one area they I feel that ICU would have been better off in just not
trying to do a half done job.  In all other areas the ICU code is top notch.

>
> 4.  Internationalization engineers spend their live retrofitting old
> code and wish that more concern for this effort had been considered
> during initial design.  If you are planning on migrating your software
> to other platforms such as Solaris (as you mention) - then using a cross
> platform approach (such as ICU) could give long term benefits in
> addition to the short term benefit of knowing that your Unicode strings
> are being processed properly.
>

You are so right.  In addition the job of globalization must be a part of
every developer's work.  To be successful good i18n engineers must provide
solutions that are comfortable for all developers in the shop.

Take the case of xiua_strcoll.  It first checks to see what data you are
collation if it is UTF-8 it invokes xiu8_strcoll.  (You can call
xiu8_strcoll if the data will always be in UTF-8)  This routine fist
determines to see how much work memory you need and if the scratch working
stack is large enough.  If so it then transforms the two strings to UTF-16.
Then it invokes xiu2_strcoll.  This routine open an ICU collator.  It sets
the collator normalization as canonical decomposition followed by canonical
composition.  Then I set it not to shift the non ignorable character
handling.  Turn case of and set the collation strength to tertiary.  They I
issue the collate and close the collator.

I don't expect the average programmer to have to understand normalization or
collation principles.  They just want to compare using a linguistically
correct collation for that locale.  This is why I wrote xIUA it gives people
a starter interface system that they can use or get ideas from.  I assumes
that not every programmer will be a i18n guru.

Received on Friday, 28 September 2001 15:16:37 UTC