RE: character encoding assumptions and approaches from Lunau Carrol on 2002-03-06 (www-zig@w3.org from March 2002)

From: Lunau Carrol <carrol.lunau@nlc-bnc.ca>
Date: Wed, 6 Mar 2002 08:52:58 -0500
To: "'Pieter Van Lierop'" <pvanlierop@geac.fr>, "'Ray Denenberg'" <rden@loc.gov>, zig <www-zig@w3.org>
Message-ID: <890459A65CB1D2119FC00008C7BAC619022F9C29@exchange2>
Pieter
What do you mean when you say in the Bath Profile. I would be happy to
incorporate it but I need somebody to write the section. Are you
volunteering? Carrol

-----Original Message-----
From: Pieter Van Lierop [mailto:pvanlierop@geac.fr]
Sent: Wednesday, March 06, 2002 4:33 AM
To: 'Ray Denenberg'; zig
Subject: RE: character encoding assumptions and approaches


Sorry but I do not agree with your assumptions.

1. I still think that the Z39.50 protocol should not bother with the
contents of anything that is not defined in the Z39.50 protocol. For example
a MARC record. From the point of view of a MARC record, Z39.50 is only a
transport mechanism. The MARC syntaxes have their own committees, standards,
protocols, traditions, national standards, international standards: we
should not bother with that.
However, the solution you propose (an option bit and a Diagnostic from the
server) is a very simple solution and is easy to ignore (that is good
because of compatibility with the current implementations).

2. The character set agreement that we are discussing does not only imply to
the search term, but to all fields defined as "International String". Is
this correct or not?
This means that, amongst others, the following fields are to be considered: 
ImplementationId, ImplementationName, ResultSetName/ResultSetId,
DatabaseName, AdditionalInfo (in a diagnostic), ElementSetName, DisplayTerm
(in Scan)
Actually, the Term (in Search and Scan) is generally considered to be an
OCTET STRING. I believe that most client applications send it as an OCTET
STRING.
Does that mean that when the client application sends Term as an OCTET
STRING, the character set agreement does *not* apply?

3. I think that the best solution would be in the Bath profile, because they
need it and they will probably be the only ones who are going to implement
it.
Then, I would suggest to use a field somewhere in the InitRequest indicating
the name or OID of the Character Set protocol. When the server can not
handle this, it returns a diagnostic and closes the connection.
But as I said, I can live Ray's solution.

Pieter van Lierop

> -----Message d'origine-----
> De : Ray Denenberg [mailto:rden@loc.gov]
> Envoyé : mardi 5 mars 2002 22:38
> À : zig
> Objet : character encoding assumptions and approaches
> 
> 
> I have some ideas on the character set encoding
> problem, but before I develop them further, or put
> them out for discussion and possibly yet more
> digression, I have a few questions:
> 
> First, may I infer the following from the
> discussion so far:
> 
> 1. We agree that it's a good idea to add an option
> bit allowing negotiation of utf-8, subject to
> agreement about the scope of negotiation;
> specifically:
> 2.  We want a mechanism to overide utf-8, in a
> present request, or for a specific record in a
> present response; however:
> 3. We don't need to overide utf-8 for a search
> term. (Thus we don't need to define a character
> set encoding attribute, at least, not for now, and
> negotiation of utf-8 will mean that all search
> term are supplied in utf-8.)
> 
> If these assumptions are correct then we've
> distilled the character encoding problem down to
> how to overide utf-8.
> 
> I further assume:
> (4) we agree that the implicit approach won't
> work, that is, the native encoding of a format
> implicitly overiding utf-8, and that we need an
> explicit  mechanism.
> 
> I don't want to try to solve this by throwing oids
> at the problem.  I think it's shortsighted. No,
> we're not going to run out of oids. But as Matthew
> and others have pointed out, there are a number of
> dimensions already -- base syntax, schema,
> character encoding -- and don't forget format:
> (i.e bibliographic, authority, holdings, community
> information, classification -- see
> http://lcweb.loc.gov/z3950/agency/defns/oids.html#format).
> It wouldn't take  long to have an unmamageable oid
> tree.
> 
> And furthermore,  the abstractions we've developed
> for Z39.50 are it strength and we should exploit
> them. Perhaps we did a good job of developing
> abstractions and not so good a job of engineering
> them into the protocol, at least not from
> contemporary perspective. Perhaps it's not
> out-of-the question to consider some
> reverse-engineering, rather than throwing out the
> model.
> 
> Now, the straightforward Z39.50 approach would
> use:
> (a) compspec, espec, and variant on the request,
> and
> (b) grs-1 (with embedded variant) on the response.
> 
> and the sentiment is that this is overkill for
> what we're narrowly focusing on now, which is
> simply the ability to specify an encoding for a
> marc record.
> 
> I think we can come up with a solution for (a),
> the request part.  I think (b), the response part,
> is harder.
> 
> My question, at this point,  is: is it (a) that
> people resist, and are we willing to put marc
> records in grs-1?  Z39.50 is still an asn.1
> protocol, don't forget. So it isn't as though
> you're going to avoid asn.1 by sending straight
> marc rather than marc wrapped in grs-1.
> 
> But assuming you don't want to do grs-1,  is this
> a reasonable alternative:  assume we come up with
> a solution for the request. The records would be
> supplied in the native record syntax (marc21,
> ukmarc, etc.) encoded as requested; if the server
> cannot supply records in the requested encoding it
> fails the request or supplies surrogate
> diagnostics.
> 
> Please give this some thought.
> 
> --Ray
>
Received on Wednesday, 6 March 2002 08:52:53 UTC