RE: character encoding assumptions and approaches from Pieter Van Lierop on 2002-03-06 (www-zig@w3.org from March 2002)

From: Pieter Van Lierop <pvanlierop@geac.fr>
Date: Wed, 6 Mar 2002 14:57:31 +0100
To: "'Lunau Carrol'" <carrol.lunau@nlc-bnc.ca>
Cc: zig <www-zig@w3.org>, "'Ray Denenberg'" <rden@loc.gov>
Message-ID: <00DE8F985709D6119F6B00805F851D8504F779@parisexchange.fr.geac.com>
Carrol,
Yes I could write a proposal for this. But let us first wait for the
discussion on the list. Ray had a strong opinion of *not* solving this via a
profile but through general Z39.50 solution.

Pieter

> -----Message d'origine-----
> De : Lunau Carrol [mailto:carrol.lunau@nlc-bnc.ca]
> Envoyé : mercredi 6 mars 2002 14:53
> À : 'Pieter Van Lierop'; 'Ray Denenberg'; zig
> Objet : RE: character encoding assumptions and approaches
> 
> 
> Pieter
> What do you mean when you say in the Bath Profile. I would be happy to
> incorporate it but I need somebody to write the section. Are you
> volunteering? Carrol
> 
> -----Original Message-----
> From: Pieter Van Lierop [mailto:pvanlierop@geac.fr]
> Sent: Wednesday, March 06, 2002 4:33 AM
> To: 'Ray Denenberg'; zig
> Subject: RE: character encoding assumptions and approaches
> 
> 
> Sorry but I do not agree with your assumptions.
> 
> 1. I still think that the Z39.50 protocol should not bother with the
> contents of anything that is not defined in the Z39.50 
> protocol. For example
> a MARC record. From the point of view of a MARC record, 
> Z39.50 is only a
> transport mechanism. The MARC syntaxes have their own 
> committees, standards,
> protocols, traditions, national standards, international standards: we
> should not bother with that.
> However, the solution you propose (an option bit and a 
> Diagnostic from the
> server) is a very simple solution and is easy to ignore (that is good
> because of compatibility with the current implementations).
> 
> 2. The character set agreement that we are discussing does 
> not only imply to
> the search term, but to all fields defined as "International 
> String". Is
> this correct or not?
> This means that, amongst others, the following fields are to 
> be considered: 
> ImplementationId, ImplementationName, ResultSetName/ResultSetId,
> DatabaseName, AdditionalInfo (in a diagnostic), 
> ElementSetName, DisplayTerm
> (in Scan)
> Actually, the Term (in Search and Scan) is generally 
> considered to be an
> OCTET STRING. I believe that most client applications send it 
> as an OCTET
> STRING.
> Does that mean that when the client application sends Term as an OCTET
> STRING, the character set agreement does *not* apply?
> 
> 3. I think that the best solution would be in the Bath 
> profile, because they
> need it and they will probably be the only ones who are going 
> to implement
> it.
> Then, I would suggest to use a field somewhere in the 
> InitRequest indicating
> the name or OID of the Character Set protocol. When the server can not
> handle this, it returns a diagnostic and closes the connection.
> But as I said, I can live Ray's solution.
> 
> Pieter van Lierop
> 
> > -----Message d'origine-----
> > De : Ray Denenberg [mailto:rden@loc.gov]
> > Envoyé : mardi 5 mars 2002 22:38
> > À : zig
> > Objet : character encoding assumptions and approaches
> > 
> > 
> > I have some ideas on the character set encoding
> > problem, but before I develop them further, or put
> > them out for discussion and possibly yet more
> > digression, I have a few questions:
> > 
> > First, may I infer the following from the
> > discussion so far:
> > 
> > 1. We agree that it's a good idea to add an option
> > bit allowing negotiation of utf-8, subject to
> > agreement about the scope of negotiation;
> > specifically:
> > 2.  We want a mechanism to overide utf-8, in a
> > present request, or for a specific record in a
> > present response; however:
> > 3. We don't need to overide utf-8 for a search
> > term. (Thus we don't need to define a character
> > set encoding attribute, at least, not for now, and
> > negotiation of utf-8 will mean that all search
> > term are supplied in utf-8.)
> > 
> > If these assumptions are correct then we've
> > distilled the character encoding problem down to
> > how to overide utf-8.
> > 
> > I further assume:
> > (4) we agree that the implicit approach won't
> > work, that is, the native encoding of a format
> > implicitly overiding utf-8, and that we need an
> > explicit  mechanism.
> > 
> > I don't want to try to solve this by throwing oids
> > at the problem.  I think it's shortsighted. No,
> > we're not going to run out of oids. But as Matthew
> > and others have pointed out, there are a number of
> > dimensions already -- base syntax, schema,
> > character encoding -- and don't forget format:
> > (i.e bibliographic, authority, holdings, community
> > information, classification -- see
> > http://lcweb.loc.gov/z3950/agency/defns/oids.html#format).
> > It wouldn't take  long to have an unmamageable oid
> > tree.
> > 
> > And furthermore,  the abstractions we've developed
> > for Z39.50 are it strength and we should exploit
> > them. Perhaps we did a good job of developing
> > abstractions and not so good a job of engineering
> > them into the protocol, at least not from
> > contemporary perspective. Perhaps it's not
> > out-of-the question to consider some
> > reverse-engineering, rather than throwing out the
> > model.
> > 
> > Now, the straightforward Z39.50 approach would
> > use:
> > (a) compspec, espec, and variant on the request,
> > and
> > (b) grs-1 (with embedded variant) on the response.
> > 
> > and the sentiment is that this is overkill for
> > what we're narrowly focusing on now, which is
> > simply the ability to specify an encoding for a
> > marc record.
> > 
> > I think we can come up with a solution for (a),
> > the request part.  I think (b), the response part,
> > is harder.
> > 
> > My question, at this point,  is: is it (a) that
> > people resist, and are we willing to put marc
> > records in grs-1?  Z39.50 is still an asn.1
> > protocol, don't forget. So it isn't as though
> > you're going to avoid asn.1 by sending straight
> > marc rather than marc wrapped in grs-1.
> > 
> > But assuming you don't want to do grs-1,  is this
> > a reasonable alternative:  assume we come up with
> > a solution for the request. The records would be
> > supplied in the native record syntax (marc21,
> > ukmarc, etc.) encoded as requested; if the server
> > cannot supply records in the requested encoding it
> > fails the request or supplies surrogate
> > diagnostics.
> > 
> > Please give this some thought.
> > 
> > --Ray
> > 
>
Received on Wednesday, 6 March 2002 08:58:50 UTC