Database Server Unicode Comparability Problem from Christian Kim on 2000-07-21 (www-international@w3.org from July to September 2000)

From: Christian Kim <ckim@ecal.com>
Date: Fri, 21 Jul 2000 10:23:02 -0400
To: <www-international@w3.org>
Message-ID: <000c01bff31f$2d8d1660$f207a8c0@appoint.lan>

In the past, we were using an Oracle database that supported the WE8ISO8859P1 (a.k.a. Latin-1 or ISO 8859-1 West European) character set. We were also using a client that supported the WE8ISO8859P1 character set. This unfortunately turns out to be an inappropriate character set to support languages in parts of the world that are not Western European. However, due to the way the driver worked, since both the client (IIS Server) and the server (Oracle database) claimed they were using the same character set, no conversions were happening and data was being sent straight through, bit by bit. This allowed us to play games with the garbage-in/garbage-out method of data input and output, and be able to support characters outside the range of the Latin-1 set.

Today, we need to correctly store the data so we can better represent the characters and maintain their integrity as to be able to perform such tasks as linguist sorting as well as be sure the data is not corrupt where ?'s will be returned instead of the appropriate character. To do this we need to move towards a more universal character set.. Unfortunately, the database and the client do not have the ability to choose a matching character set that supports this. The database must use the UTF-8 character set, where the Windows client can only use the UCS-2 character set. Now we have a client and a server with different character sets and this conversion must occur.

Unfortunately our current tool set (a.k.a. driver) does not (properly) support this conversion. However the newer tool set does support this conversion, but in order for this to happen we need to use a slightly different method of accessing the database and some of our stored procedures will need to be rewritten/altered to work properly.

Here is a quick chart showing the situation:

Today (non-international):
Database:
Oracle 8.0.4 - WE8ISO8859P1
Client
Oracle 8.0.5 - WE8ISO8859P1; ADO using MSDAORA (MS OLEDB driver)
Appears to work with any character that can be represented with at most 16 bits. This is because no conversion is happening and we are seeing a garbage-in/garbage-out effect.

I18N:
Database:
Oracle 8.1.6 - UTF-8
Client
Oracle 8.0.5 - UTF-8; ADO using MSDAORA or ORAOLEDB.ORACLE
Not supported!

Database:
Oracle 8.1.6 - UTF-8
Client
Oracle 8.0.5 - WE8ISO8859P1; ADO using MSDAORA
Works only if the input characters are within the Latin-1 character set. The extended windows characters will fail. The extended Windows 1252 code page characters are as follows:

€ ‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ

Database:
Oracle 8.1.6 - UTF-8
Client
Oracle 8.0.5 - WE8MSWIN1252; ADO using MSDAORA
Not supported!

Database:
Oracle 8.1.6 - UTF-8
Client
Oracle 8.0.5 - WE8ISO8859P1; ADO using ORAOLEDB.ORACLE
Not supported!

Database:
Oracle 8.1.6 - UTF-8
Client
Oracle 8.1.6 - UTF-8; ADO using MSDAORA
Not supported!

Database:
Oracle 8.1.6 - UTF-8
Client
Oracle 8.1.6 - WE8ISO8859P1; ADO using MSDAORA
Works only if the input characters are within the Latin-1 character set. The extended windows characters will fail. The extended Windows 1252 code page characters are as follows: €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ. However, this combination of the 8.1.6 Oracle driver set and MSDAORA causes a potential crash situation and may be unstable. This is possible because MSDAORA is compiles against the 7.x Oracle driver set.

Database:
Oracle 8.1.6 - UTF-8
Client
Oracle 8.1.6 - WE8MSWIN1252; ADO using MSDAORA
Works for all Windows 1252 code page characters. However, this combination of the 8.1.6 Oracle driver set and MSDAORA causes a potential crash situation and may be unstable. This is possible because MSDAORA is compiles against the 7.x Oracle driver set.

Database:
Oracle 8.1.6 - UTF-8
Client
Oracle 8.1.6 - UTF-8; ADO using ORAOLEDB.ORACLE
Appears to work for all characters in all character sets. However, this driver needs to be tested a bit better and all stored procedures that return record sets will need to be converted to return CURSORs instead of table columns. This, in the long run, will potentially improve performance but will take a bit of work to get to a state where we can use it.

Database:
Oracle 8.1.6 - UTF-8
Client
Oracle 8.1.6 - WE8ISO8859P1; ADO using ORAOLEDB.ORACLE
Irrelevant.

Database:
Oracle 8.1.6 - UTF-8
Client
Oracle 8.1.6 - WE8MSWIN1252; ADO using ORAOLEDB.ORACLE
Irrelevant.

Sorry for the hard to read chart.. but in summary this says that we really have only two choices.

1) Using the current driver and ADO with MSDAORA and only supporting the Latin-1 character set running the risk of anyone typing in any other character an it turning into a "?". No code will need to change in ASP, COM, or CAPI layers.

2) Using the newest driver and ADO with ORAOLEDB.ORACLE and properly supporting all character sets. However some stored procedures will need to be rewritten and some code in the ASP, COM, and CAPI layers will need to be adjusted. Also, work needs to be done to prove this is the absolute correct way to go in terms of performance, character conversions, and stability.

If anyone have any suggestions and solutions, I will be glad to hear from you!
I can use some help here!

Thanks in advance!
Chris

Christian Kim
Program Analyst
e-mail: ckim@ecal.com
http://www.ecal.com
eCal Corporation
Penn Mutual Tower, 2nd Floor
520 Walnut Street
Philadelphia, PA 19106
TEL: 215-627-5001, Ex.3727
Fax: 215-574-9481
"YOUR ULTIMATE INTERNET CALENDAR!"

Received on Friday, 21 July 2000 10:23:38 UTC