- From: Paul Deuter <Paul.Deuter@plumtree.com>
- Date: Fri, 28 Sep 2001 09:57:07 -0700
- To: "Richard, Francois M" <Francois.M.Richard@usa.xerox.com>, <www-international@w3.org>
- Cc: "Carl W. Brown" <cbrown@xnetinc.com>
No one can make tradeoff judgments for you, so I won't even try. However there are some facts which you should know: 1. ICU is C/C++ open source code and therefore should work on any system. 2. UTF-8 is a MBCS where each character can be composed of 1-4 octets. Therefore you do not use wide characters with UTF-8. (Note: if you use UTF-8, you should learn it. It only takes a few minutes to understand the encoding - it is very simple and quite beautiful too. There are lots of references on the web.) 3. Most ICU interfaces do not take UTF-8 strings but rather UTF-16 strings (which are 16-bits wide). 4. Internationalization engineers spend their live retrofitting old code and wish that more concern for this effort had been considered during initial design. If you are planning on migrating your software to other platforms such as Solaris (as you mention) - then using a cross platform approach (such as ICU) could give long term benefits in addition to the short term benefit of knowing that your Unicode strings are being processed properly. -Paul Paul Deuter Internationalization Manager Plumtree Software paul.deuter@plumtree.com -----Original Message----- From: Richard, Francois M [mailto:Francois.M.Richard@usa.xerox.com] Sent: Friday, September 28, 2001 5:39 AM To: www-international@w3.org Cc: 'Carl W. Brown'; Paul Deuter Subject: RE: utf-8 Locale support on Solaris and Linux > > > > UTF-8 is not a locale. UTF-8 is a multi-byte encoding of the > > Unicode repetoire of characters. > > > > The behavior of the standard C functions depends on the > > compiler and the system that you are using. That is the information I am looking for. In particular for Linux and GNU glibc 2.2... > > > > In order to get standard cross-platform support for > > Unicode strings, I recommend using the ICU library. > > > > http://www-124.ibm.com/icu/ > OK. But assuming that cross-platform is not a criteria of selection and that the OS we need to use is LYNX for now (ICU on LYNX?), I am not sure if ICU is the answer. Also, the Unicode support we are targeting would be restricted to a few character variables. So, for instance, in a first stage, we are not interested in collation, date, time formatting. So, we are looking at a minimum C code modifications for supporting Unicode encoding for specific character variables. Solaris (starting with 2.6) and Linux have this concept of CSI. A Locale is a combination of language, country AND character encoding form. So, let me try to restate my question: In my C code, if I set the Locale to en_US.utf-8, do I need to change my character datatype to wchar_t and call the wide character functions or is my original code (char and C functions) still valid? > You are right to recommend ICU. There are differences in how > each Unix > system deals with Unicode. On Linux for example I can > convert the UTF-8 > text to Unicode wide characters with a mbstowcs. On Solaris the wide > character implementation is not Unicode. What is it then? > > ICU provides a consistent cross platform implementation. > However you either > have to convert to UTF-16 or add UTF-8 support to ICU. xIUA > http://www.xnetinc.com/xiua/ is open source code that adds full UTF-8 > support to ICU so that everything from xiua_strcoll to > xiua_strtok works > with UTF-8 strings. If you don't want the rest of the code > you can just use > the UTF-8 support code. > > > > > -Paul > > > > Paul Deuter > > Internationalization Manager > > Plumtree Software > > paul.deuter@plumtree.com > > > > > > > > -----Original Message----- > > From: Richard, Francois M [mailto:Francois.M.Richard@usa.xerox.com] > > Sent: Thursday, September 27, 2001 1:28 PM > > To: 'www-international@w3.org' > > Subject: utf-8 Locale support on Solaris and Linux > > > > > > A basic question I guess... > > > > Do C functions like strlen(), isaplha() and other locale sensitive C > > functions behave properly when Locale has been set to utf-8? > > The standard Unix setlocale is not thread safe. Using ICU > there are no such > restrictions. If you also use xIUA with ICU then you can use > the setlocle > style of programming but be thread safe. You can use POSIX > locales with > xiua_OpenLocale. For example: > > xiua_OpenLocale("pt_BR.utf-8",XDFCODEPAGE); /* UTF-8 data > with an underlying > UTF-8 code page*/ > > xiua_OpenLocale("pt_BR.iso-8859-1",XDFUTF8; /* UTF-8 data with an > underlying iso-8859-1 code page*/ So, external character encoding is going to be assumed iso-8859-1 (for instance opening and reading file will assume that any character is encoded in iso-8859-1) and converted in UTF-8 internally (in memory). Is my interpretation right? > > For web applications xIUA also has some special functions. > For example if > you want to determine what character set to use for a browser > it provides a > routine to analyze the Accept-Charset string and find the > best character set > for the specific locale. > > Carl >
Received on Friday, 28 September 2001 12:57:18 UTC