RE: Distributed searches in Z39.50? from Alex Khokhlov on 2004-04-06 (www-zig@w3.org from April 2004)

From: Alex Khokhlov <alex@lib.msu.ru>
Date: Tue, 6 Apr 2004 19:55:41 +0400
To: "'Alan Kent'" <ajk@mds.rmit.edu.au>, "'ZIG'" <www-zig@w3.org>
Message-Id: <0404061944567000@mail.lib.msu.ru>
Hi Alan!

<warning> highly personal opinion </warning>

I don't know if my information will be of any help for you or not, but I
would like to briefly describe the project I've done for the Moscow State
University Scientific Library last year. It covers many of the questions
mentioned here. The aim of the project was to build a highly scalable
world-wide z39.50 search engine, and I think we've reached our goal. We have
created a free web-based library portal for searching Z39.50 servers that
can search more than 1500 catalogs in parallel. You can try it online here:
http://www.sigla.ru/?i18n=en.

The main problem for the working implementation of a world-wide distributed
search engine as we saw it from the beginning was mainly two-fold:
1. Current methods of querying many Z39.50 catalogs are no faster than the
slowest participating server, because gateway is waiting for all servers to
answer before presenting any records.
2. There are still major issues with interoperability between different
Z39.50 servers: access points and attribute combinations differ in different
implementations and databases. Each catalog requires its own customization
of gateway to produce good results.

Therefore, we have made 2 main decisions in our gateway design: 
1. It should be as much asynchronous and parallel as possible. As a result,
gateway should be able to display results as they come from each catalog and
work with each result set separately.
2. There should be some fully automatic and dynamic engine for query
reconfiguration if any search semantic problems arise. As a result, we have
implemented a scheme for dynamic query reformulation: unsupported access
points and/or attributes are simply removed from query, thus producing less
accurate result, but it's still better than plain 'diag error'. Actual field
experience showed that dropping unsupported pieces of query is justified: it
is the best thing that can be done in most situations that will produce the
best result possible according to the initial request.

As for actual implementation, several interesting technical features were
implemented to build the final product. The asynchronous behavior was done
in a 'working thread pool' manner (active working threads, passive z39.50
connection objects). A pool is dynamically populated with threads as needed,
but can't grow over 1000 threads. Each thread is taking a task from a task
queue (one of the 'search', 'sort', 'fetch' or 'browse' for 1 catalog) and
performs necessary actions with remote catalogue in a standard synchronous
manner. Therefore there is no need for any additional developments in any of
existing Z39.50 servers. Any results those are already available are
immediately displayed to the user and can be used without waiting for all
distributed search to complete.

This asynchronous and separate processing of each catalogue also has another
nice aspect: there is no need for strict agreements on common decisions
between catalogues. Gateway uses the native format for bibliographic data
for each catalog, different character encoding schemes and query
reconfiguration is working with each catalog specifically.


So, my opinion about making distributed search in Z39.50 is the following:

1. Keep communication protocols as simple as possible - it will broaden its
usage and will allow more people to participate in building the
infrastructure

2. Search servers should only be responsible for a set of very simple and
independent small tasks - this will keep everything clear and allow more
optimization techniques to be applied in the future

3. Any sophistication or complexity of end-user solutions should be
implemented in portals, GUI applications & other similar client stuff.

This is how we can actually build a diverse distributed infrastructure of
information databases. Other methods are too complicated to be accepted by a
wider community. 

Therefore, there is actually no real need in 'partitial' results
implementation, except that you want to use other metasearch engine like
Sigla in your application. But is another question, and I'm now
investigating possible ways of doing this kind of thing as simple as
possible. I've already got some practical results you can try online - they
are implemented as an extension to SRU protocol. I can provide you with more
detailed information if you are interested.

Best regards, Alex Khokhlov.
Received on Tuesday, 6 April 2004 11:55:38 UTC