W3C home > Mailing lists > Public > www-international@w3.org > July to September 2000

Re: Unicode Conversions

From: Mark Davis <markdavis@ispchannel.com>
Date: Thu, 07 Sep 2000 06:58:10 -0700
Message-ID: <39B79EF2.35923D71@ispchannel.com>
To: Stephen Toner <Stephen.Toner@virtualaccess.com>
CC: www-international@w3.org, www.unicode.org@ispchannel.com

> Hello all,
> I have been trying to input unicode from a browser and store it in a database.  The problem is the different encodings used to represent the unicode.
> The input text is in the UTF-8 format.  I have read on the Microsoft support site that SQL Server 7.0 uses a different Unicode encoding (UCS-2) and does not recognize UTF-8
> as valid character data.  Of the solutions offered only two were of any use:
> 1) Convert between the two on input and output
> 2) Store as raw data in binary form
> I have been unable to get the raw data into the database correctly so decided to try the first option.  However although I keep reading that round conversion between the 2
> formats is quick, easy and reliable, i have been unable to accomplish this.  I am using JSPs, so the Session.Codepage command doesn't work, and anyway I would prefer a
> less platform specific solution.  Does anyone know of a way of converting a java string in UTF-8 to UTF-16 format.
I talk about it a bit in an older paper of mine, at

You can either use the String API or Stream API. For Strings use:

String utf16chars = new String(utf8bytes[],"UTF8");

utf16bytes = utf16chars.getBytes("UTF8");

For Streams, use InputStreamReader
or OutputStreamWriter.

> Also I was wondering if anyone knows why the UTF-8 can't be treated as a regular Latin1 string.  My database is set to use the Cp1252 code page, and so should this not
Whenever you mark bytes with the wrong codepage, you are likely to get
errors; any software that interprets or converts those bytes will get
the wrong answer. Using Cp1252 when what you are storing is either
UTF-8 or UTF-16 will give you problems.

> recognise the characters input to it? eg A japanese character in UTF-8 was broken down to ??? and these three characters are in the windows character set.  However by
> the time it reaches the database it is changed to ?    Does this mean that somewhere along the way the string is being changed into a different form where the character set
> doesn't support certain characters?   Does the fact that Java internally uses UTF-16(I think) cause any problems?
Java supports UCS-2, but UTF-16 is simply an extension of UCS-2, and
shares the same storage. The difference is not relevant to you here.

> Thanks for any suggestions,
> Stephen
> (If you have just gotten this message already I apologise but I was having difficulty with registration)
Received on Thursday, 7 September 2000 09:56:13 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 22:04:17 UTC