RE: UTF8 vs. Unicode (UTF16) in code from Yves Arrouye on 2001-03-14 (www-international@w3.org from January to March 2001)

From: Yves Arrouye <yves@realnames.com>
Date: Wed, 14 Mar 2001 13:47:40 -0800
To: "'Allan Chau'" <achau@rsasecurity.com>, www-international@w3.org
Message-ID: <B7233BD6980AD411875700508B5BD5D2011E8D33@elporto.internal.realnames.com>

> We're considering between using UTF8 within the code vs. changing our
> code to use wide characters. I'm wondering what experiences 
> others have
> had that can help with our decision.  I'm thinking that using UTF8
> internally may mean less rewriting initially, but we'd have to check
> carefully for code that make assumptions about character boundaries.
> Because of this, I think that it'd be more complicated for 
> developers to
> have to work with UTF8 in code.  Unicode (UTF16) internally would be
> easier to manage since most characters will essentially be 
> fixed width,
> but there'd be alot of code to rewrite.  Also, I've heard of problems
> with the wide character type (wchar_t) having different definitions
> depending on platform (we're running on NT & Sun Solaris).  
> Many of our
> product APIs would also be affected.

You will also have to worry about character boundaries soon in UTF-16
because of the non-plane 0 characters that are being encoded into Unicode
3.1. The Unicode mailing lists contains a number of threads showing that you
just cannot ignore them as some of them are very frequently used characters
in given countries.

The quick solution for the wchar_t lack of portability is to not use
wchar_t. Since you are running NT and Solaris, I would recommend that you
use a cross-platform library like ICU, an open source Unicode library, used
in projects like Apache's Xerces XML parser for example. You can get more
information about ICU at the Internet Keyword: International Components for
Unicode
(http://oss.software.ibm.com/developerworks/opensource/icu/project/). This
will not only solve issues with wchar_t but also guarantee that, should you
exchanged data in legacy character sets, your conversion tables are the same
on all the platforms you use.

YA

Received on Wednesday, 14 March 2001 16:48:04 UTC