- From: Alan Kent <ajk@mds.rmit.edu.au>
- Date: Tue, 26 Feb 2002 11:03:01 +1100
- To: zig <www-zig@w3.org>
On Thu, Feb 21, 2002 at 02:07:59PM -0500, Ray Denenberg wrote: > I posted a message last October about character > sets... We support Unicode data, but I must admit we have not done it the official Z39.50 way - because it was too hard and too global. I am a little hesitant to admit the slightly mungy way we do things, but it has worked for a range of projects over several years now. Changing the encoding of *all* InternationalString's based on an init option was hard to do. Also, we can have a mix of databases supporting Unicode and simple ASCII in the one server. Rather than try to normalize everything into one encoding from whatever encoding the db designer used when putting the data into the system, we instead return information about the encoding on a per GRS-1 field basis (and leave everything else as plain ASCII). For example, when returning SGML or XML, we actually return additional information about the SGML and XML to help the client work out how to parse it correctly. A part of this is the encoding of the text (ASCII, UTF-8, UTF-16 etc). For query terms that are supplied, we did not mark the terms (although that sounds like a good idea). Instead, if the USE attribute was bound to a Unicode field, then we assumed the term was UTF-8 encoded. If not, we assumed plain ASCII. Having a per query term attribute sounds like a better (safer!) way to me. (The global flag is also ok here.) I guess there is a fundamental question as to whether the server can return data in the format that it chooses to do so, or whether it must do what the client commands. In other words, do clients ask for a preferred encoding, or can clients demand the return encoding. We do neither at present - we let the server return the data in whatever encoding it wants to (as defined by the database administrator who set up the database in the first place) and tell the client what the encoding was. On the other hand.... I just asked one of the developers who would have to really do the work, and he asked 'if it was not a global flag, which strings would be converted and which would not? Element set names? Result set names? OCTET STRING of an EXTERNAL? GRS-1 values? etc etc.' To quote him: > I would go with a modified version of (a). (b) has the problem that is > pinpoints only one place where unicode is as issue. How should scan > terms be returned? Are database names utf-8 or not? etc. The simplest > thing to define in the standard is have a bit indicate either utf-8 or > not for all character strings. So I do not have a definite opinion at present, but would like to do Unicode in a more interoperable way (but have to work out how hard it would be to do with our current toolkits etc). Alan
Received on Monday, 25 February 2002 19:04:02 UTC