character encoding assumptions and approaches from Ray Denenberg on 2002-03-05 (www-zig@w3.org from March 2002)

From: Ray Denenberg <rden@loc.gov>
Date: Tue, 05 Mar 2002 16:37:40 -0500
To: zig <www-zig@w3.org>
Message-ID: <3C853AA4.828230C5@loc.gov>
I have some ideas on the character set encoding
problem, but before I develop them further, or put
them out for discussion and possibly yet more
digression, I have a few questions:

First, may I infer the following from the
discussion so far:

1. We agree that it's a good idea to add an option
bit allowing negotiation of utf-8, subject to
agreement about the scope of negotiation;
specifically:
2.  We want a mechanism to overide utf-8, in a
present request, or for a specific record in a
present response; however:
3. We don't need to overide utf-8 for a search
term. (Thus we don't need to define a character
set encoding attribute, at least, not for now, and
negotiation of utf-8 will mean that all search
term are supplied in utf-8.)

If these assumptions are correct then we've
distilled the character encoding problem down to
how to overide utf-8.

I further assume:
(4) we agree that the implicit approach won't
work, that is, the native encoding of a format
implicitly overiding utf-8, and that we need an
explicit  mechanism.

I don't want to try to solve this by throwing oids
at the problem.  I think it's shortsighted. No,
we're not going to run out of oids. But as Matthew
and others have pointed out, there are a number of
dimensions already -- base syntax, schema,
character encoding -- and don't forget format:
(i.e bibliographic, authority, holdings, community
information, classification -- see
http://lcweb.loc.gov/z3950/agency/defns/oids.html#format).
It wouldn't take  long to have an unmamageable oid
tree.

And furthermore,  the abstractions we've developed
for Z39.50 are it strength and we should exploit
them. Perhaps we did a good job of developing
abstractions and not so good a job of engineering
them into the protocol, at least not from
contemporary perspective. Perhaps it's not
out-of-the question to consider some
reverse-engineering, rather than throwing out the
model.

Now, the straightforward Z39.50 approach would
use:
(a) compspec, espec, and variant on the request,
and
(b) grs-1 (with embedded variant) on the response.

and the sentiment is that this is overkill for
what we're narrowly focusing on now, which is
simply the ability to specify an encoding for a
marc record.

I think we can come up with a solution for (a),
the request part.  I think (b), the response part,
is harder.

My question, at this point,  is: is it (a) that
people resist, and are we willing to put marc
records in grs-1?  Z39.50 is still an asn.1
protocol, don't forget. So it isn't as though
you're going to avoid asn.1 by sending straight
marc rather than marc wrapped in grs-1.

But assuming you don't want to do grs-1,  is this
a reasonable alternative:  assume we come up with
a solution for the request. The records would be
supplied in the native record syntax (marc21,
ukmarc, etc.) encoded as requested; if the server
cannot supply records in the requested encoding it
fails the request or supplies surrogate
diagnostics.

Please give this some thought.

--Ray
Received on Tuesday, 5 March 2002 16:36:28 UTC