RE: utf-8 Locale support on Solaris and Linux from Paul Deuter on 2001-09-28 (www-international@w3.org from July to September 2001)

From: Paul Deuter <Paul.Deuter@plumtree.com>
Date: Fri, 28 Sep 2001 09:57:07 -0700
To: "Richard, Francois M" <Francois.M.Richard@usa.xerox.com>, <www-international@w3.org>
Cc: "Carl W. Brown" <cbrown@xnetinc.com>
Message-ID: <C7F00D7948B8E4468BB330152C6BA4E00AACE3@cstaex03.USIPLUMTREE.AD>
No one can make tradeoff judgments for you, so I won't even try.

However there are some facts which you should know:

1.  ICU is C/C++ open source code and therefore should work on any
system.

2.  UTF-8 is a MBCS where each character can be composed of 1-4 octets.
Therefore you do not use wide characters with UTF-8.  (Note: if you use
UTF-8, you should learn it.  It only takes a few minutes to understand
the encoding - it is very simple and quite beautiful too.  There are
lots of references on the web.)

3.  Most ICU interfaces do not take UTF-8 strings but rather UTF-16
strings (which are 16-bits wide).

4.  Internationalization engineers spend their live retrofitting old
code and wish that more concern for this effort had been considered
during initial design.  If you are planning on migrating your software
to other platforms such as Solaris (as you mention) - then using a cross
platform approach (such as ICU) could give long term benefits in
addition to the short term benefit of knowing that your Unicode strings
are being processed properly.

-Paul


Paul Deuter
Internationalization Manager
Plumtree Software
paul.deuter@plumtree.com 
 


-----Original Message-----
From: Richard, Francois M [mailto:Francois.M.Richard@usa.xerox.com]
Sent: Friday, September 28, 2001 5:39 AM
To: www-international@w3.org
Cc: 'Carl W. Brown'; Paul Deuter
Subject: RE: utf-8 Locale support on Solaris and Linux





> >
> > UTF-8 is not a locale.  UTF-8 is a multi-byte encoding of the
> > Unicode repetoire of characters.
> >
> > The behavior of the standard C functions depends on the
> > compiler and the system that you are using.

That is the information I am looking for. In particular for Linux and
GNU
glibc 2.2...


> >
> > In order to get standard cross-platform support for
> > Unicode strings, I recommend using the ICU library.
> >
> > http://www-124.ibm.com/icu/
> 

OK. But assuming that cross-platform is not a criteria of selection and
that
the OS we need to use is LYNX for now (ICU on LYNX?), I am not sure if
ICU
is the answer. Also, the Unicode support we are targeting would be
restricted to a few character variables. So, for instance, in a first
stage,
we are not interested in collation, date, time formatting.

So, we are looking at a minimum C code modifications for supporting
Unicode
encoding for specific character variables.

Solaris (starting with 2.6) and Linux have this concept of CSI. A Locale
is
a combination of language, country AND character encoding form. So, let
me
try to restate my question:

In my C code, if I set the Locale to en_US.utf-8, do I need to change my
character datatype to wchar_t and call the wide character functions or
is my
original code (char and C functions) still valid?



> You are right to recommend ICU.  There are differences in how 
> each Unix
> system deals with Unicode.  On Linux for example I can 
> convert the UTF-8
> text to Unicode wide characters with a mbstowcs.  On Solaris the wide
> character implementation is not Unicode.

What is it then?

> 
> ICU provides a consistent cross platform implementation.
> However you either
> have to convert to UTF-16 or add UTF-8 support to ICU.  xIUA
> http://www.xnetinc.com/xiua/ is open source code that adds full UTF-8
> support to ICU so that everything from xiua_strcoll to 
> xiua_strtok works
> with UTF-8 strings.  If you don't want the rest of the code 
> you can just use
> the UTF-8 support code.
> 
> >
> > -Paul
> >
> > Paul Deuter
> > Internationalization Manager
> > Plumtree Software
> > paul.deuter@plumtree.com
> >
> >
> >
> > -----Original Message-----
> > From: Richard, Francois M [mailto:Francois.M.Richard@usa.xerox.com]
> > Sent: Thursday, September 27, 2001 1:28 PM
> > To: 'www-international@w3.org'
> > Subject: utf-8 Locale support on Solaris and Linux
> >
> >
> > A basic question I guess...
> >
> > Do C functions like strlen(), isaplha() and other locale sensitive C
> > functions behave properly when Locale has been set to utf-8?
> 
> The standard Unix setlocale is not thread safe.  Using ICU 
> there are no such
> restrictions.  If you also use xIUA with ICU then you can use 
> the setlocle
> style of programming but be thread safe.  You can use POSIX 
> locales with
> xiua_OpenLocale.  For example:
> 
> xiua_OpenLocale("pt_BR.utf-8",XDFCODEPAGE); /* UTF-8 data 
> with an underlying
> UTF-8 code page*/
> 
> xiua_OpenLocale("pt_BR.iso-8859-1",XDFUTF8;  /* UTF-8 data with an
> underlying iso-8859-1 code page*/

So, external character encoding is going to be assumed iso-8859-1 (for
instance opening and reading file will assume that any character is
encoded
in iso-8859-1) and converted in UTF-8 internally (in memory). Is my
interpretation right?

> 
> For web applications xIUA also has some special functions.  
> For example if
> you want to determine what character set to use for a browser 
> it provides a
> routine to analyze the Accept-Charset string and find the 
> best character set
> for the specific locale.
> 
> Carl
>
Received on Friday, 28 September 2001 12:57:18 UTC