- From: Richard, Francois M <Francois.M.Richard@usa.xerox.com>
- Date: Fri, 28 Sep 2001 08:39:02 -0400
- To: www-international@w3.org
- Cc: "'Carl W. Brown'" <cbrown@xnetinc.com>, "'Paul Deuter'" <Paul.Deuter@plumtree.com>
> >
> > UTF-8 is not a locale. UTF-8 is a multi-byte encoding of the
> > Unicode repetoire of characters.
> >
> > The behavior of the standard C functions depends on the
> > compiler and the system that you are using.
That is the information I am looking for. In particular for Linux and GNU
glibc 2.2...
> >
> > In order to get standard cross-platform support for
> > Unicode strings, I recommend using the ICU library.
> >
> > http://www-124.ibm.com/icu/
>
OK. But assuming that cross-platform is not a criteria of selection and that
the OS we need to use is LYNX for now (ICU on LYNX?), I am not sure if ICU
is the answer. Also, the Unicode support we are targeting would be
restricted to a few character variables. So, for instance, in a first stage,
we are not interested in collation, date, time formatting.
So, we are looking at a minimum C code modifications for supporting Unicode
encoding for specific character variables.
Solaris (starting with 2.6) and Linux have this concept of CSI. A Locale is
a combination of language, country AND character encoding form. So, let me
try to restate my question:
In my C code, if I set the Locale to en_US.utf-8, do I need to change my
character datatype to wchar_t and call the wide character functions or is my
original code (char and C functions) still valid?
> You are right to recommend ICU. There are differences in how
> each Unix
> system deals with Unicode. On Linux for example I can
> convert the UTF-8
> text to Unicode wide characters with a mbstowcs. On Solaris the wide
> character implementation is not Unicode.
What is it then?
>
> ICU provides a consistent cross platform implementation.
> However you either
> have to convert to UTF-16 or add UTF-8 support to ICU. xIUA
> http://www.xnetinc.com/xiua/ is open source code that adds full UTF-8
> support to ICU so that everything from xiua_strcoll to
> xiua_strtok works
> with UTF-8 strings. If you don't want the rest of the code
> you can just use
> the UTF-8 support code.
>
> >
> > -Paul
> >
> > Paul Deuter
> > Internationalization Manager
> > Plumtree Software
> > paul.deuter@plumtree.com
> >
> >
> >
> > -----Original Message-----
> > From: Richard, Francois M [mailto:Francois.M.Richard@usa.xerox.com]
> > Sent: Thursday, September 27, 2001 1:28 PM
> > To: 'www-international@w3.org'
> > Subject: utf-8 Locale support on Solaris and Linux
> >
> >
> > A basic question I guess...
> >
> > Do C functions like strlen(), isaplha() and other locale sensitive C
> > functions behave properly when Locale has been set to utf-8?
>
> The standard Unix setlocale is not thread safe. Using ICU
> there are no such
> restrictions. If you also use xIUA with ICU then you can use
> the setlocle
> style of programming but be thread safe. You can use POSIX
> locales with
> xiua_OpenLocale. For example:
>
> xiua_OpenLocale("pt_BR.utf-8",XDFCODEPAGE); /* UTF-8 data
> with an underlying
> UTF-8 code page*/
>
> xiua_OpenLocale("pt_BR.iso-8859-1",XDFUTF8; /* UTF-8 data with an
> underlying iso-8859-1 code page*/
So, external character encoding is going to be assumed iso-8859-1 (for
instance opening and reading file will assume that any character is encoded
in iso-8859-1) and converted in UTF-8 internally (in memory). Is my
interpretation right?
>
> For web applications xIUA also has some special functions.
> For example if
> you want to determine what character set to use for a browser
> it provides a
> routine to analyze the Accept-Charset string and find the
> best character set
> for the specific locale.
>
> Carl
>
Received on Friday, 28 September 2001 08:39:28 UTC