RE: utf-8 Locale support on Solaris and Linux

Carl,

> > 3.  Most ICU interfaces do not take UTF-8 strings but rather UTF-16
> > strings (which are 16-bits wide).
> 
> ICU has some macros for UTF-8 support but you have to look at them
> carefully.  They were added to ICU because they do not add to 
> the code size.
> They are not a complete UTF-8 support package.  There are 
> some that I use
> but others can get you into trouble.  We just had a 
> discussion on the use of
> such macros to count the number of characters in a string.  
> There are two
> classes of support macros SAFE and UNSAFE.  The SAFE validate 
> the data and
> the UNSAFE which run faster do not.  Using either macro in a 
> routine will
> produce a bad count if the data is bad.  The count may differ 
> depending on
> the choice but neither will give you any indication that the 
> count is wrong.
> 
> In my humble opinion, you are better off implementing your own 
> routines for
> many of these functions.  They can be faster and be more 
> reliable.  This is
> the one area they I feel that ICU would have been better off 
> in just not
> trying to do a half done job.  In all other areas the ICU 
> code is top notch.

This is good advice, but it leads naturally to another question:  Why
doesn't ICU have a branch that provides equivalent support to the existing
code, but for text encoded in UTF-8?  I know that you can convert easily
between UTF-8 and UTF-16, but you really want to have a system that is
designed, optimized, and tested for your native encoding.  There are a *lot*
of Unicode implementations that will be based on UTF-8, so I don't think
this is an unusual request.  Has this been considered before?  Would it take
a lot of work to complement the existing ICU libraries with native UTF-8
versions and maintain them in parallel?

Merle

Received on Friday, 28 September 2001 16:22:06 UTC