Re: Migrating legacy multilingual data to utf-8 from A. Vine on 2005-02-28 (www-international@w3.org from January to March 2005)

From: A. Vine <andrea.vine@Sun.COM>
Date: Mon, 28 Feb 2005 15:24:24 -0800
To: Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>
Cc: www-international@w3.org
Message-id: <4223A828.7050401@sun.com>
One more thing to watch out for - be sure that all the data within a 
professed encoding is actually in that encoding.  For example, we had 
EUC-JP in one of our ISO-8859-1 databases, and it worked only because 
everything was passing the bytecodes through and then the browser 
interface used Japanese Auto-detect.  When the data were converted to 
UTF-8, the EUC-JP data got mangled, but it still looked OK because the 
extracting program converted it back into ISO-8859-1, didn't label it 
and the Auto-detect did its magic.  But the UTF-8 data we putinto the 
database didn't work.  We had to find all the EUC-JP data and convert it 
separately, then change the extracting program.

Tex Texin wrote:
> Hi Deborah,
> 
> You will need to assess whether the encoding labels are correct. Often
> fonts correct for imprecise encoding labels.
> For example, cp936 being mislabeled as gb2312. Many of the windows
> encodings are supersets of other standard encodings. If the conversion
> between gb2312 and unicode is performed instead of the one between cp936
> and unicode, then the additional characters will not transcode properly.
> 
> When you move to utf-8 the fixed width fields will grow considerable.
> UTF-8 will take 4 bytes per character. Instead use the variable width
> character datatypes. Then you only use as much storage as is needed.
> 
> You should also consider utf-16. Depending on the script distribution of
> your data it may be more efficient for storage and/or performance.
> 
> If you are moving your multilingual data to unicode to standardize the
> representation and so you can use database indexes meaningfully on it,
> then you may need to consider multiple indexes by language.
> For example, you may want one index for French users and another index
> on the identical fields for Swedish users, so that any specification of
> ranges (e.g. g < x and x < y ) are correct for the language, and also so
> the records are sorted correctly by language.
> If the database is frequently updated multiple indexes will require
> multiple updates per write.
> On the other hand, if the data is mostly accessed and infrequently
> updated then the cost for additional indexes is not great.
> 
> I would not do an update in place. It is risky. Also if you are updating
> a large database, updating indexes on each write can make it slow. I
> would dump the data as a text file, convert it to utf8- and then load
> the data into an empty database.
> 
> Make sure your trigger procedures support unicode.
> 
> hth
> tex
> 
> Deborah Cawkwell wrote:
> 
>>We have legacy multilingual data stored in a Postgres database.
>>
>>In our database text is typically stored in the charcater encoding
>>in which it was entered. This typically corresponds with the language of
>>the text, e.g. Czech: windows-1250; Chinese: gb2312
>>
>>We wish to take more advantege of Unicode, and hence we are considering
>>migrating the data to UTF-8.
>>
>>What is the best way to do this?
>>
>>Any advice, experience would be welcome.
>>
>>http://www.bbc.co.uk/
>>
>>This e-mail (and any attachments) is confidential and may contain
>>personal views which are not the views of the BBC unless specifically
>>stated.
>>If you have received it in error, please delete it from your system.
>>Do not use, copy or disclose the information in any way nor act in
>>reliance on it and notify the sender immediately. Please note that the
>>BBC monitors e-mails sent or received.
>>Further communication will signify your consent to this.
> 
>
Received on Monday, 28 February 2005 23:19:31 UTC