RE: utf-8 Locale support on Solaris and Linux from Richard, Francois M on 2001-09-28 (www-international@w3.org from July to September 2001)

From: Richard, Francois M <Francois.M.Richard@usa.xerox.com>
Date: Fri, 28 Sep 2001 08:39:02 -0400
To: www-international@w3.org
Cc: "'Carl W. Brown'" <cbrown@xnetinc.com>, "'Paul Deuter'" <Paul.Deuter@plumtree.com>
Message-id: <B08661D21F0FD311A21A00805FC7D65001EA34E5@usa0845ms1.svcdoc.mc.xerox.com>

> >
> > UTF-8 is not a locale.  UTF-8 is a multi-byte encoding of the
> > Unicode repetoire of characters.
> >
> > The behavior of the standard C functions depends on the
> > compiler and the system that you are using.

That is the information I am looking for. In particular for Linux and GNU
glibc 2.2...


> >
> > In order to get standard cross-platform support for
> > Unicode strings, I recommend using the ICU library.
> >
> > http://www-124.ibm.com/icu/
> 

OK. But assuming that cross-platform is not a criteria of selection and that
the OS we need to use is LYNX for now (ICU on LYNX?), I am not sure if ICU
is the answer. Also, the Unicode support we are targeting would be
restricted to a few character variables. So, for instance, in a first stage,
we are not interested in collation, date, time formatting.

So, we are looking at a minimum C code modifications for supporting Unicode
encoding for specific character variables.

Solaris (starting with 2.6) and Linux have this concept of CSI. A Locale is
a combination of language, country AND character encoding form. So, let me
try to restate my question:

In my C code, if I set the Locale to en_US.utf-8, do I need to change my
character datatype to wchar_t and call the wide character functions or is my
original code (char and C functions) still valid?



> You are right to recommend ICU.  There are differences in how 
> each Unix
> system deals with Unicode.  On Linux for example I can 
> convert the UTF-8
> text to Unicode wide characters with a mbstowcs.  On Solaris the wide
> character implementation is not Unicode.

What is it then?

> 
> ICU provides a consistent cross platform implementation.
> However you either
> have to convert to UTF-16 or add UTF-8 support to ICU.  xIUA
> http://www.xnetinc.com/xiua/ is open source code that adds full UTF-8
> support to ICU so that everything from xiua_strcoll to 
> xiua_strtok works
> with UTF-8 strings.  If you don't want the rest of the code 
> you can just use
> the UTF-8 support code.
> 
> >
> > -Paul
> >
> > Paul Deuter
> > Internationalization Manager
> > Plumtree Software
> > paul.deuter@plumtree.com
> >
> >
> >
> > -----Original Message-----
> > From: Richard, Francois M [mailto:Francois.M.Richard@usa.xerox.com]
> > Sent: Thursday, September 27, 2001 1:28 PM
> > To: 'www-international@w3.org'
> > Subject: utf-8 Locale support on Solaris and Linux
> >
> >
> > A basic question I guess...
> >
> > Do C functions like strlen(), isaplha() and other locale sensitive C
> > functions behave properly when Locale has been set to utf-8?
> 
> The standard Unix setlocale is not thread safe.  Using ICU 
> there are no such
> restrictions.  If you also use xIUA with ICU then you can use 
> the setlocle
> style of programming but be thread safe.  You can use POSIX 
> locales with
> xiua_OpenLocale.  For example:
> 
> xiua_OpenLocale("pt_BR.utf-8",XDFCODEPAGE); /* UTF-8 data 
> with an underlying
> UTF-8 code page*/
> 
> xiua_OpenLocale("pt_BR.iso-8859-1",XDFUTF8;  /* UTF-8 data with an
> underlying iso-8859-1 code page*/

So, external character encoding is going to be assumed iso-8859-1 (for
instance opening and reading file will assume that any character is encoded
in iso-8859-1) and converted in UTF-8 internally (in memory). Is my
interpretation right?

> 
> For web applications xIUA also has some special functions.  
> For example if
> you want to determine what character set to use for a browser 
> it provides a
> routine to analyze the Accept-Charset string and find the 
> best character set
> for the specific locale.
> 
> Carl
>

Received on Friday, 28 September 2001 08:39:28 UTC