RE: utf-8 Locale support on Solaris and Linux from Carl W. Brown on 2001-09-28 (www-international@w3.org from July to September 2001)

From: Carl W. Brown <cbrown@xnetinc.com>
Date: Fri, 28 Sep 2001 08:33:10 -0700
To: <www-international@w3.org>
Message-ID: <FNEHIHOMIIDPDCIFEJEGGEMMCJAA.cbrown@xnetinc.com>
Francois,

I forgot to mention that you can take advantage of the fact that UTF-8 is
different from other MBCS character sets.  The major difference is that
there is no overlap between lead and trailing bytes.

You will notice that functions like strlen will work on UTF-8 data.  Some of
the xIUA functions are comparable functions like xiu8_strstr is essentially
just like strstr.  In this case you could use strstr on UTF-8 data because
the lead and trailing characters are different.  The xiu8_strstr runs a bit
faster and returns error codes.  The strstr function will compare every byte
to every byte so you will have extra scans that will not hurt.  You could
even use strstr for a UTF-8 strchr if the character match character has a
null termination.  Again it will be slower but it will work.

Carl



> -----Original Message-----
> From: www-international-request@w3.org
> [mailto:www-international-request@w3.org]On Behalf Of Carl W. Brown
> Sent: Friday, September 28, 2001 7:46 AM
> To: www-international@w3.org
> Subject: RE: utf-8 Locale support on Solaris and Linux
>
>
> Francois,
> > > >
> > > > UTF-8 is not a locale.  UTF-8 is a multi-byte encoding of the
> > > > Unicode repetoire of characters.
> > > >
> > > > The behavior of the standard C functions depends on the
> > > > compiler and the system that you are using.
> >
> > That is the information I am looking for. In particular for
> Linux and GNU
> > glibc 2.2...
> >
>
> You can start with:
> http://www-106.ibm.com/developerworks/linux/library/l-linuni.html
>
> >
> > > >
> > > > In order to get standard cross-platform support for
> > > > Unicode strings, I recommend using the ICU library.
> > > >
> > > > http://www-124.ibm.com/icu/
> > >
> >
> > OK. But assuming that cross-platform is not a criteria of
> > selection and that
> > the OS we need to use is LYNX for now (ICU on LYNX?), I am not
> sure if ICU
> > is the answer. Also, the Unicode support we are targeting would be
> > restricted to a few character variables. So, for instance, in a
> > first stage,
> > we are not interested in collation, date, time formatting.
>
> Collation, data time formation are the locale dependent functions.
>
> >
> > So, we are looking at a minimum C code modifications for
> > supporting Unicode
> > encoding for specific character variables.
> >
> > Solaris (starting with 2.6) and Linux have this concept of CSI. A
> > Locale is
> > a combination of language, country AND character encoding form.
> So, let me
> > try to restate my question:
> >
> > In my C code, if I set the Locale to en_US.utf-8, do I need to change my
> > character datatype to wchar_t and call the wide character
> > functions or is my
> > original code (char and C functions) still valid?
> >
>
> If you are working with UTF-8 and only need strlen, strchr,
> strstr, strcpy,
> strncpy, strcat, strcmp, strpbrk, strspn, strcspn, strtok,
> strtok_r etc. you
> are merely manipulations UTF-8 text with a knowledge of the UTF-8 encoding
> format but without any regard to the actual meaning of any UTF-8
> characters.
> For that you don't really need Unicode or locale support.
>
> You mentioned that you are doing this first implying that later you will
> want more Unicode support.  It is best to figure out where you want to end
> up to avoid redoing work or ending up with solutions that are difficult to
> migrate later.
>
> If this is want you want you can build a UTF-8 support library that will
> work on both LYNX and Solaris.  This will give you the functions that you
> need now and later you can add support for Unicode functionality later
> without changing any code you have already written.
>
> Probably the easiest way is to use xIUA it contains the code for UTF-8
> string handling.  Look for the xiu8_ routines.  You can use the code from
> xIUA your function library code.  You will notice some routines like
> xiu8_strncpyEx.  This is probably more useful than strncpy because it will
> copy date to the output buffer insuring that it only copies complete UTF-8
> characters and always adds a null.  If you use strncpy and add a
> null at the
> end you may not have a valid string because it may end with a partial
> character.  xiu8_strchrEx is another extended function.  With the standard
> strchr your character is an int.  This does not work for UTF-8.
> Instead it
> uses a pointer to the character in storage that you are looking for.
>
> xIUA also contains UTF-8 validation routines, routine to count the
> characters in a string, routine to test the length of a
> character, routines
> to move one logical character forward and backwards in a string.
>
> You can use a static variable for the xiu8_strtok working
> pointer.  This way
> it will work like any other strtok.  The full xIUA implementation uses
> thread local storage so it makes this function thread safe.  You can use
> xiua_strtok_r for thread safe or concurrent tokenizing.
>
> Later it you want functions that actually deal with the
> characters, you can
> add ICU support.  If you need strcasecmp, strcoll, strftime etc
> then you can
> add the rest of the xIUA support.  The xIUA strftime support is a two step
> process.  It uses a special function that uses the ICU locale specific
> configuration data to convert a strftime format to an ICU format string.
> You can then call the formatting or parsing services.
>
> The xIUA locale is like the POSIX locale in that it has language, country,
> character set, variant but it also adds time zone to the locale.  It also
> allows each thread to have several locale open at the same time.
> I can have
> a locale for a browser that is "ja_JP.Shift_JIS", a HTML file locale of
> "ja_JP.EUC-JP", a SQL locale using UTF-8 and database data locale using
> UTF-16 and a system services wide character locale using UTF-32.
>
> Carl
>
>
Received on Friday, 28 September 2001 11:33:22 UTC