- From: A. Vine <andrea.vine@Sun.COM>
- Date: Mon, 28 Feb 2005 15:24:24 -0800
- To: Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>
- Cc: www-international@w3.org
One more thing to watch out for - be sure that all the data within a professed encoding is actually in that encoding. For example, we had EUC-JP in one of our ISO-8859-1 databases, and it worked only because everything was passing the bytecodes through and then the browser interface used Japanese Auto-detect. When the data were converted to UTF-8, the EUC-JP data got mangled, but it still looked OK because the extracting program converted it back into ISO-8859-1, didn't label it and the Auto-detect did its magic. But the UTF-8 data we putinto the database didn't work. We had to find all the EUC-JP data and convert it separately, then change the extracting program. Tex Texin wrote: > Hi Deborah, > > You will need to assess whether the encoding labels are correct. Often > fonts correct for imprecise encoding labels. > For example, cp936 being mislabeled as gb2312. Many of the windows > encodings are supersets of other standard encodings. If the conversion > between gb2312 and unicode is performed instead of the one between cp936 > and unicode, then the additional characters will not transcode properly. > > When you move to utf-8 the fixed width fields will grow considerable. > UTF-8 will take 4 bytes per character. Instead use the variable width > character datatypes. Then you only use as much storage as is needed. > > You should also consider utf-16. Depending on the script distribution of > your data it may be more efficient for storage and/or performance. > > If you are moving your multilingual data to unicode to standardize the > representation and so you can use database indexes meaningfully on it, > then you may need to consider multiple indexes by language. > For example, you may want one index for French users and another index > on the identical fields for Swedish users, so that any specification of > ranges (e.g. g < x and x < y ) are correct for the language, and also so > the records are sorted correctly by language. > If the database is frequently updated multiple indexes will require > multiple updates per write. > On the other hand, if the data is mostly accessed and infrequently > updated then the cost for additional indexes is not great. > > I would not do an update in place. It is risky. Also if you are updating > a large database, updating indexes on each write can make it slow. I > would dump the data as a text file, convert it to utf8- and then load > the data into an empty database. > > Make sure your trigger procedures support unicode. > > hth > tex > > Deborah Cawkwell wrote: > >>We have legacy multilingual data stored in a Postgres database. >> >>In our database text is typically stored in the charcater encoding >>in which it was entered. This typically corresponds with the language of >>the text, e.g. Czech: windows-1250; Chinese: gb2312 >> >>We wish to take more advantege of Unicode, and hence we are considering >>migrating the data to UTF-8. >> >>What is the best way to do this? >> >>Any advice, experience would be welcome. >> >>http://www.bbc.co.uk/ >> >>This e-mail (and any attachments) is confidential and may contain >>personal views which are not the views of the BBC unless specifically >>stated. >>If you have received it in error, please delete it from your system. >>Do not use, copy or disclose the information in any way nor act in >>reliance on it and notify the sender immediately. Please note that the >>BBC monitors e-mails sent or received. >>Further communication will signify your consent to this. > >
Received on Monday, 28 February 2005 23:19:31 UTC