W3C home > Mailing lists > Public > www-international@w3.org > January to March 2002

Re: Rossette library for Unicode

From: Asmus Freytag <asmusf@ix.netcom.com>
Date: Wed, 06 Mar 2002 11:59:29 -0800
Message-Id: <4.2.0.58.20020306115739.01bf5008@popd.ix.netcom.com>
To: "souravm" <souravm@infy.com>, <www-international@w3.org>
At 08:28 AM 3/6/02 +0530, souravm wrote:
>Hi All,
>
>Can anyone, who has worked with Rossette library for handling Unicode 
>characters, clarify my following doubts ?>Can anyone, who has worked with 
>Rossette library for handling Unicode
 >characters, clarify my following doubts ?

I asked Tom Emerson, Senior Computational Linguist at BASIS, and he gave me 
the following answer:

---------------------------------------------------------------------

You can send these questions to unicode-support@basistech.com, which
could get a faster answer.

 >1. Rosette library defines a class bt_string for holding 8 bit strings. It
 >is possible to create a non uncode string from Unicode string using
 >ExternalEncoding class. The sample code is as follows -
 >
 >bt_string sjisHello("\u0065\u23ff", ExternalEncoding::ShiftJISMS);
 >
 >In the above code the unicode string (the first arument in the contructor)
 >will be converted to Shift_JIS.
 >Now my question is Shift_JIS supports multibytes characters. But bt_string
 >can support only single byte (8-bit) characters . So in that case how it
 >works ?

In this case you need to think of bt_string as a container for octets,
not logical characters. In essence any multi-octet encoding (including
UTF-8) can be contained in a bt_string. So, to convert a Unicode
string to ShiftJIS, you would use:

Char16 my_ucs_2[] = { 0x3053, 0x306B, 0x3061, 0x308F, 0x0000 }
bt_string sjisHello(my_ucs_2, ExternalEncoding::ShiftJISMS);

Now sjisHello contains the ShiftJIS encoded octents for the four
Unicode characters in my_ucs_2.

Going the other way, you could use

bt_wstring uniHello(sjisHello, ExternalEncoding::ShiftJISMS);

 >2. Now the bt_string class is different than normal character array of C ?
 >In both the cases single byte charcaters are supported.

Yes, bt_string is different than a regular C character array because
there are no (within the limits of your machine) bounds on the size of
the string. You can append characters/strings to it and the underlying
storage will grow to fit. Internally bt_string is implemented in terms
of the C char (or probably unsigned char, though I don't remember
right now) type.

Hope that helps,

     -tree

--
Tom Emerson                                          Basis Technology Corp.
Sr. Computational Linguist                         http://www.basistech.com
   "Beware the lollipop of mediocrity: lick it once and you suck forever" 
Received on Wednesday, 6 March 2002 14:58:55 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:58 GMT