RE: Free C implementation of form C from Carl W. Brown on 2001-08-12 (www-international@w3.org from July to September 2001)

From: Carl W. Brown <cbrown@xnetinc.com>
Date: Sun, 12 Aug 2001 16:01:43 -0700
To: "Martin Duerst" <duerst@w3.org>, "Bjoern Hoehrmann" <derhoermi@gmx.net>, <www-international@w3.org>
Message-ID: <FNEHIHOMIIDPDCIFEJEGEEDPCIAA.cbrown@xnetinc.com>

Martin,

They now call it International Components for Unicode.
http://oss.software.ibm.com/icu/  Good point about supporting Unicode 3.1.
All new Unicode implementations should.

The problem with ICU is that it is not small.  I have been working with it
for a year and a half now and it is a great product.  But if you are going
to use the unorm and normalizer code then you also need the uchar code for
character properties and to load the tables you need udata and resbund and
of course everyone needs putil etc.  If you pull out the code you don't need
you still have large Unicode character property tables.  By the time you are
through you won't have a small piece of code.

I think the best approach considering that he only needs the ICU common
routine DSO/DLL and data DLL is that he can ship them with the pre compiled
code.  People wishing to compile from source will have to install ICU
themselves at least on most Unix platforms.

I like ICU not only because it is open source but that it is the best
product available on the market.  In fact I am so impressed that I have
dedicated more than 5 man months of work contributing internal code to ICU
and creating open source code to help people migrate to Unicode using ICU.
I hope that this will help people move to Unicode.  My code (xIUA) is not
for all software and this product would probably not benefit from xIUA.
http://www.xnetinc.com/xiua/  But it would benefit from ICU.  In addition to
normalization it could probably use the conversion routines.  Working with
HTML I expect that Tidy will have to deal with UTF-8 and various code pages.
You can open a converter for the specific code page that the user specifies
and you can look at the xIUA code as an example to see how easy it is to get
ICU to return the MIME code page name for the code page that you are using
with standard ICU calls.  This way if the user specifies a valid but
non-standard code page name you could convert it to the MIME standard name.
For example "cp1252" would become "windows-1252".

Carl

> -----Original Message-----
> From: www-international-request@w3.org
> [mailto:www-international-request@w3.org]On Behalf Of Martin Duerst
> Sent: Saturday, August 11, 2001 1:23 AM
> To: Bjoern Hoehrmann; www-international@w3.org
> Subject: Re: Free C implementation of form C
>
>
> Bjoern - Please check out ICU (IBM Classes for Unicode, I guess).
>
> And please make that Unicode Version 3.1, there is a small but
> important bug fix in 3.1.
>
> Regards,   Martin.
>
> At 08:09 01/08/11 +0200, Bjoern Hoehrmann wrote:
> >Hi,
> >
> >    Is there any free and tiny ANSI C implementation of Unicode
> >Normalization Form C out there? I want to implement the Early
> >Uniform Normalization as in [1] in HTML Tidy [2] and such an
> >implementation would be very helpful. It should be based on
> >Unicode 3.0. It should come free-standing with optimised
> >Unicode data and hopefully act on either int[] or char*s UTF-8
> >encoded.
> >
> >[1] http://www.w3.org/TR/charmod/#sec-Normalization
> >[2] http://sourceforge.net/projects/tidy
> >
> >TIA,
> >--
> >Bj��n H��rmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de
> >am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
> >25899 Dageb・l { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
>

Received on Sunday, 12 August 2001 19:01:47 UTC