W3C home > Mailing lists > Public > www-international@w3.org > July to September 2003

Re: displaying Chinese and Thai characters

From: Addison Phillips [wM] <aphillips@webmethods.com>
Date: Wed, 06 Aug 2003 09:11:35 -0700
Message-ID: <3F3128B7.2080608@webmethods.com>
To: "Audrey Ng (by way of Martin Duerst <duerst@w3.org>)" <audrey@nxspace.com>
CC: www-international@w3.org

Hi Audrey,

The problem is that you are retrieving the string "\u5000\u5001" and not 
the characters that you are trying to represent by using an escape 
sequence. A properties file is converted from another encoding when it 
is read in (and ListResourceBundles are converted by javac to true 
Unicode sequences). Another way to say this is that you are really 
retrieving the string "\\u5000\\u5001" !

It's important to remember that java.lang.String objects are always 
Unicode internally. It is how you convert to/from external sources that 
matters. In the case of a database, though, you are retrieving String 
objects using JDBC. The conversion is done somewhere else, outside your 
control. Presumably you had to write some code to insert \u5000 (etc.) 
into your database instead of the character U+5000. You have to reverse 
that encoding procedure to retrieve the original character.

Recent mySQL versions (since 8.5) can use the UTF-8 (or UCS-2, aka 
UTF-16) encoding of Unicode. Then you just read/write String objects 
(which are always encoded as Unicode) to/from the database (and not 
worry about encodings) and not mess with escape sequences. This is a far 
better choice, since it means that you can also access the data in the 
database directly.

Best Regards,

Addison

-- 
Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.

+1 408.962.5487  mailto:aphillips@webmethods.com
-------------------------------------------
Internationalization is an architecture. It is not a feature.

Chair, W3C I18N WG Web Services Task Force
http://www.w3.org/International/ws



Audrey Ng (by way of Martin Duerst ) wrote:
> 
> 
> Hi all,
> 
> this is my very first project dealing with internationalization and I am 
> very confused about all these character sets and encodings. Any help 
> would be most welcome.
> Ok, I need to display Chinese(traditional and simplified) and Thai on a 
> website. I am using Tomcat4.1 and mySQL 4.0.14. How do I store these 
> Chinese and Thai characters in mySQL?
> Can I store the unicode escape sequence like \u5000\u5001 directly in 
> mySQL?
> I have tried that, but when I retrieve the data in my servlet and then 
> forward it to a JSP to display the result, the characters are displayed 
> as such \u5000\u5001 and not in chinese. I have set the content type in 
> my page directive as well as the META content-type to UTF-8 already.
> I have tried using Resourbundles and the Chinese characters are correcly 
> displayed.
> What is the difference between retrieving the unicode escape sequence 
> from the properties file and from the database.
> 
> Please help!
> Audrey
> 
> 
Received on Wednesday, 6 August 2003 12:20:58 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:00 GMT