- From: Richard, Francois M <Francois.M.Richard@usa.xerox.com>
- Date: Fri, 28 Sep 2001 08:39:02 -0400
- To: www-international@w3.org
- Cc: "'Carl W. Brown'" <cbrown@xnetinc.com>, "'Paul Deuter'" <Paul.Deuter@plumtree.com>
> > > > UTF-8 is not a locale. UTF-8 is a multi-byte encoding of the > > Unicode repetoire of characters. > > > > The behavior of the standard C functions depends on the > > compiler and the system that you are using. That is the information I am looking for. In particular for Linux and GNU glibc 2.2... > > > > In order to get standard cross-platform support for > > Unicode strings, I recommend using the ICU library. > > > > http://www-124.ibm.com/icu/ > OK. But assuming that cross-platform is not a criteria of selection and that the OS we need to use is LYNX for now (ICU on LYNX?), I am not sure if ICU is the answer. Also, the Unicode support we are targeting would be restricted to a few character variables. So, for instance, in a first stage, we are not interested in collation, date, time formatting. So, we are looking at a minimum C code modifications for supporting Unicode encoding for specific character variables. Solaris (starting with 2.6) and Linux have this concept of CSI. A Locale is a combination of language, country AND character encoding form. So, let me try to restate my question: In my C code, if I set the Locale to en_US.utf-8, do I need to change my character datatype to wchar_t and call the wide character functions or is my original code (char and C functions) still valid? > You are right to recommend ICU. There are differences in how > each Unix > system deals with Unicode. On Linux for example I can > convert the UTF-8 > text to Unicode wide characters with a mbstowcs. On Solaris the wide > character implementation is not Unicode. What is it then? > > ICU provides a consistent cross platform implementation. > However you either > have to convert to UTF-16 or add UTF-8 support to ICU. xIUA > http://www.xnetinc.com/xiua/ is open source code that adds full UTF-8 > support to ICU so that everything from xiua_strcoll to > xiua_strtok works > with UTF-8 strings. If you don't want the rest of the code > you can just use > the UTF-8 support code. > > > > > -Paul > > > > Paul Deuter > > Internationalization Manager > > Plumtree Software > > paul.deuter@plumtree.com > > > > > > > > -----Original Message----- > > From: Richard, Francois M [mailto:Francois.M.Richard@usa.xerox.com] > > Sent: Thursday, September 27, 2001 1:28 PM > > To: 'www-international@w3.org' > > Subject: utf-8 Locale support on Solaris and Linux > > > > > > A basic question I guess... > > > > Do C functions like strlen(), isaplha() and other locale sensitive C > > functions behave properly when Locale has been set to utf-8? > > The standard Unix setlocale is not thread safe. Using ICU > there are no such > restrictions. If you also use xIUA with ICU then you can use > the setlocle > style of programming but be thread safe. You can use POSIX > locales with > xiua_OpenLocale. For example: > > xiua_OpenLocale("pt_BR.utf-8",XDFCODEPAGE); /* UTF-8 data > with an underlying > UTF-8 code page*/ > > xiua_OpenLocale("pt_BR.iso-8859-1",XDFUTF8; /* UTF-8 data with an > underlying iso-8859-1 code page*/ So, external character encoding is going to be assumed iso-8859-1 (for instance opening and reading file will assume that any character is encoded in iso-8859-1) and converted in UTF-8 internally (in memory). Is my interpretation right? > > For web applications xIUA also has some special functions. > For example if > you want to determine what character set to use for a browser > it provides a > routine to analyze the Accept-Charset string and find the > best character set > for the specific locale. > > Carl >
Received on Friday, 28 September 2001 08:39:28 UTC