W3C home > Mailing lists > Public > www-international@w3.org > July to September 2001

Re: Unicode support for C/C++

From: Thierry Sourbier <webmaster@i18ngurus.com>
Date: Mon, 20 Aug 2001 09:24:07 +0200
Message-ID: <004a01c12949$19df4aa0$b1f4fea9@dell400>
To: <www-international@w3.org>

> C can also handle Unicode by using UTF-8 as the
> multi-byte encoding in the char * type.

That is somewhat true, UTF-8 present some interesting characteristics for C

* It preserves the ASCII characters (all characters <128 remain as-is in
* UTF-8 encoded strings do not contain NULL bytes.

Therefore if your programs relies on recognizing some ASCII sequences AND
does not modify characters that have a code above 128 (i.e. is 8-bit clean)
then your program may work with UTF-8 just fine.

Of course, you'll need to understand that you can no longer:
1. Use any *unsafe* functions such as tolower() or toupper(), that may
corrupt characters above 128.
2. Rely on the fact that 1 byte = 1 char for  random character access (e.g.
myString[5]) or string memory allocation as a single character can occupy
multiple bytes.
3. Do string sorting as it will provide funky results if strings contains
non-ASCII characters.
4. Rely on string compare as it may be unreliable due to the various Unicode
normalization forms.

Note that this is only a quick rundown on potential issues. So yes, C can
handle UTF-8 just fine, but there is a high potential for doing *wrong*
things (you may argue that this is a feature of C, but I won't go there...
:).The difficulty of adding UTF-8 support will depend on what you are doing
with all your char*. If you do a lot of string manipulation, it may be a
good time for you to either use the Unicode Windows API's as you are on NT,
use the free ICU or any other commercially available Unicode toolset

Some good source of information for you may be:
http://www.unicode.org for all the information on Unicode
http://oss.software.ibm.com/icu/  for information on ICU (for a C wrapper
look at http://www.xnetinc.com/xiua/).

I would also recommend reading "Adding internationalization support to the
base standard for JavaScript" by Richard Gillam which is a good case study
on adding Unicode support to legacy code.

Finally you can have a look on my site below to get plenty more links :).

Thierry Sourbier
www.i18ngurus.com - Open internationalization resources directory.

----- Original Message -----
From: "souravm" <souravm@infy.com>
To: <www-international@w3.org>
Sent: Monday, August 20, 2001 8:12 AM
Subject: Unicode support for C/C++

> Hi All,
> I've a software written in C in Windows NT platform. I want to upgrade
> it for Unicode support. I got this information from net that - C can
> also handle Unicode by using UTF-8 as the multi-byte encoding in the
> char * type. I want to know hoe excatly it can be implemented.
> Thanks in advance,
> Sourav
Received on Monday, 20 August 2001 03:17:50 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 22:04:18 UTC