RE: utf-8 Locale support on Solaris and Linux from Carl W. Brown on 2001-09-28 (www-international@w3.org from July to September 2001)

From: Carl W. Brown <cbrown@xnetinc.com>
Date: Fri, 28 Sep 2001 07:45:46 -0700
To: <www-international@w3.org>
Message-ID: <FNEHIHOMIIDPDCIFEJEGKEMLCJAA.cbrown@xnetinc.com>
Francois,
> > >
> > > UTF-8 is not a locale.  UTF-8 is a multi-byte encoding of the
> > > Unicode repetoire of characters.
> > >
> > > The behavior of the standard C functions depends on the
> > > compiler and the system that you are using.
>
> That is the information I am looking for. In particular for Linux and GNU
> glibc 2.2...
>

You can start with:
http://www-106.ibm.com/developerworks/linux/library/l-linuni.html

>
> > >
> > > In order to get standard cross-platform support for
> > > Unicode strings, I recommend using the ICU library.
> > >
> > > http://www-124.ibm.com/icu/
> >
>
> OK. But assuming that cross-platform is not a criteria of
> selection and that
> the OS we need to use is LYNX for now (ICU on LYNX?), I am not sure if ICU
> is the answer. Also, the Unicode support we are targeting would be
> restricted to a few character variables. So, for instance, in a
> first stage,
> we are not interested in collation, date, time formatting.

Collation, data time formation are the locale dependent functions.

>
> So, we are looking at a minimum C code modifications for
> supporting Unicode
> encoding for specific character variables.
>
> Solaris (starting with 2.6) and Linux have this concept of CSI. A
> Locale is
> a combination of language, country AND character encoding form. So, let me
> try to restate my question:
>
> In my C code, if I set the Locale to en_US.utf-8, do I need to change my
> character datatype to wchar_t and call the wide character
> functions or is my
> original code (char and C functions) still valid?
>

If you are working with UTF-8 and only need strlen, strchr, strstr, strcpy,
strncpy, strcat, strcmp, strpbrk, strspn, strcspn, strtok, strtok_r etc. you
are merely manipulations UTF-8 text with a knowledge of the UTF-8 encoding
format but without any regard to the actual meaning of any UTF-8 characters.
For that you don't really need Unicode or locale support.

You mentioned that you are doing this first implying that later you will
want more Unicode support.  It is best to figure out where you want to end
up to avoid redoing work or ending up with solutions that are difficult to
migrate later.

If this is want you want you can build a UTF-8 support library that will
work on both LYNX and Solaris.  This will give you the functions that you
need now and later you can add support for Unicode functionality later
without changing any code you have already written.

Probably the easiest way is to use xIUA it contains the code for UTF-8
string handling.  Look for the xiu8_ routines.  You can use the code from
xIUA your function library code.  You will notice some routines like
xiu8_strncpyEx.  This is probably more useful than strncpy because it will
copy date to the output buffer insuring that it only copies complete UTF-8
characters and always adds a null.  If you use strncpy and add a null at the
end you may not have a valid string because it may end with a partial
character.  xiu8_strchrEx is another extended function.  With the standard
strchr your character is an int.  This does not work for UTF-8.  Instead it
uses a pointer to the character in storage that you are looking for.

xIUA also contains UTF-8 validation routines, routine to count the
characters in a string, routine to test the length of a character, routines
to move one logical character forward and backwards in a string.

You can use a static variable for the xiu8_strtok working pointer.  This way
it will work like any other strtok.  The full xIUA implementation uses
thread local storage so it makes this function thread safe.  You can use
xiua_strtok_r for thread safe or concurrent tokenizing.

Later it you want functions that actually deal with the characters, you can
add ICU support.  If you need strcasecmp, strcoll, strftime etc then you can
add the rest of the xIUA support.  The xIUA strftime support is a two step
process.  It uses a special function that uses the ICU locale specific
configuration data to convert a strftime format to an ICU format string.
You can then call the formatting or parsing services.

The xIUA locale is like the POSIX locale in that it has language, country,
character set, variant but it also adds time zone to the locale.  It also
allows each thread to have several locale open at the same time.  I can have
a locale for a browser that is "ja_JP.Shift_JIS", a HTML file locale of
"ja_JP.EUC-JP", a SQL locale using UTF-8 and database data locale using
UTF-16 and a system services wide character locale using UTF-32.

Carl
Received on Friday, 28 September 2001 10:45:52 UTC