Re: Migrating legacy multilingual data to utf-8 from Tex Texin on 2005-02-23 (www-international@w3.org from January to March 2005)

From: Tex Texin <tex@xencraft.com>
Date: Wed, 23 Feb 2005 09:39:57 -0800
To: Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>
CC: www-international@w3.org
Message-ID: <421CBFED.EF9E0F28@xencraft.com>

Hi Deborah,

You will need to assess whether the encoding labels are correct. Often
fonts correct for imprecise encoding labels.
For example, cp936 being mislabeled as gb2312. Many of the windows
encodings are supersets of other standard encodings. If the conversion
between gb2312 and unicode is performed instead of the one between cp936
and unicode, then the additional characters will not transcode properly.

When you move to utf-8 the fixed width fields will grow considerable.
UTF-8 will take 4 bytes per character. Instead use the variable width
character datatypes. Then you only use as much storage as is needed.

You should also consider utf-16. Depending on the script distribution of
your data it may be more efficient for storage and/or performance.

If you are moving your multilingual data to unicode to standardize the
representation and so you can use database indexes meaningfully on it,
then you may need to consider multiple indexes by language.
For example, you may want one index for French users and another index
on the identical fields for Swedish users, so that any specification of
ranges (e.g. g < x and x < y ) are correct for the language, and also so
the records are sorted correctly by language.
If the database is frequently updated multiple indexes will require
multiple updates per write.
On the other hand, if the data is mostly accessed and infrequently
updated then the cost for additional indexes is not great.

I would not do an update in place. It is risky. Also if you are updating
a large database, updating indexes on each write can make it slow. I
would dump the data as a text file, convert it to utf8- and then load
the data into an empty database.

Make sure your trigger procedures support unicode.

hth
tex

Deborah Cawkwell wrote:
> 
> We have legacy multilingual data stored in a Postgres database.
> 
> In our database text is typically stored in the charcater encoding
> in which it was entered. This typically corresponds with the language of
> the text, e.g. Czech: windows-1250; Chinese: gb2312
> 
> We wish to take more advantege of Unicode, and hence we are considering
> migrating the data to UTF-8.
> 
> What is the best way to do this?
> 
> Any advice, experience would be welcome.
> 
> http://www.bbc.co.uk/
> 
> This e-mail (and any attachments) is confidential and may contain
> personal views which are not the views of the BBC unless specifically
> stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor act in
> reliance on it and notify the sender immediately. Please note that the
> BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
Xen Master                          http://www.i18nGuy.com

XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------

Received on Thursday, 24 February 2005 11:46:06 UTC