- From: Tex Texin <tex@xencraft.com>
- Date: Wed, 23 Feb 2005 09:39:57 -0800
- To: Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>
- CC: www-international@w3.org
Hi Deborah, You will need to assess whether the encoding labels are correct. Often fonts correct for imprecise encoding labels. For example, cp936 being mislabeled as gb2312. Many of the windows encodings are supersets of other standard encodings. If the conversion between gb2312 and unicode is performed instead of the one between cp936 and unicode, then the additional characters will not transcode properly. When you move to utf-8 the fixed width fields will grow considerable. UTF-8 will take 4 bytes per character. Instead use the variable width character datatypes. Then you only use as much storage as is needed. You should also consider utf-16. Depending on the script distribution of your data it may be more efficient for storage and/or performance. If you are moving your multilingual data to unicode to standardize the representation and so you can use database indexes meaningfully on it, then you may need to consider multiple indexes by language. For example, you may want one index for French users and another index on the identical fields for Swedish users, so that any specification of ranges (e.g. g < x and x < y ) are correct for the language, and also so the records are sorted correctly by language. If the database is frequently updated multiple indexes will require multiple updates per write. On the other hand, if the data is mostly accessed and infrequently updated then the cost for additional indexes is not great. I would not do an update in place. It is risky. Also if you are updating a large database, updating indexes on each write can make it slow. I would dump the data as a text file, convert it to utf8- and then load the data into an empty database. Make sure your trigger procedures support unicode. hth tex Deborah Cawkwell wrote: > > We have legacy multilingual data stored in a Postgres database. > > In our database text is typically stored in the charcater encoding > in which it was entered. This typically corresponds with the language of > the text, e.g. Czech: windows-1250; Chinese: gb2312 > > We wish to take more advantege of Unicode, and hence we are considering > migrating the data to UTF-8. > > What is the best way to do this? > > Any advice, experience would be welcome. > > http://www.bbc.co.uk/ > > This e-mail (and any attachments) is confidential and may contain > personal views which are not the views of the BBC unless specifically > stated. > If you have received it in error, please delete it from your system. > Do not use, copy or disclose the information in any way nor act in > reliance on it and notify the sender immediately. Please note that the > BBC monitors e-mails sent or received. > Further communication will signify your consent to this. -- ------------------------------------------------------------- Tex Texin cell: +1 781 789 1898 mailto:Tex@XenCraft.com Xen Master http://www.i18nGuy.com XenCraft http://www.XenCraft.com Making e-Business Work Around the World -------------------------------------------------------------
Received on Thursday, 24 February 2005 11:46:06 UTC