Re: Z39.50 character encoding

On Thu, Feb 21, 2002 at 02:07:59PM -0500, Ray Denenberg wrote:
> I posted a message last October about  character
> sets...

We support Unicode data, but I must admit we have not done it the
official Z39.50 way - because it was too hard and too global.
I am a little hesitant to admit the slightly mungy way we
do things, but it has worked for a range of projects
over several years now.

Changing the encoding of *all* InternationalString's based on an
init option was hard to do. Also, we can have a mix of databases
supporting Unicode and simple ASCII in the one server. Rather than
try to normalize everything into one encoding from whatever
encoding the db designer used when putting the data into the
system, we instead return information about the encoding on
a per GRS-1 field basis (and leave everything else as plain ASCII).

For example, when returning SGML or XML, we actually return
additional information about the SGML and XML to help the client
work out how to parse it correctly. A part of this is the encoding
of the text (ASCII, UTF-8, UTF-16 etc).

For query terms that are supplied, we did not mark the terms
(although that sounds like a good idea). Instead, if the USE
attribute was bound to a Unicode field, then we assumed the
term was UTF-8 encoded. If not, we assumed plain ASCII.
Having a per query term attribute sounds like a better (safer!)
way to me. (The global flag is also ok here.)

I guess there is a fundamental question as to whether the
server can return data in the format that it chooses to do so,
or whether it must do what the client commands. In other words,
do clients ask for a preferred encoding, or can clients demand
the return encoding. We do neither at present - we let the server
return the data in whatever encoding it wants to (as defined by
the database administrator who set up the database in the first
place) and tell the client what the encoding was.

On the other hand....

I just asked one of the developers who would have to really do
the work, and he asked 'if it was not a global flag, which strings
would be converted and which would not? Element set names?
Result set names? OCTET STRING of an EXTERNAL? GRS-1 values?
etc etc.' To quote him:

> I would go with a modified version of (a). (b) has the problem that is
> pinpoints only one place where unicode is as issue. How should scan
> terms be returned? Are database names utf-8 or not? etc. The simplest
> thing to define in the standard is have a bit indicate either utf-8 or
> not for all character strings.

So I do not have a definite opinion at present, but would like to
do Unicode in a more interoperable way (but have to work out how
hard it would be to do with our current toolkits etc).

Alan

Received on Monday, 25 February 2002 19:04:02 UTC