- From: Alex Khokhlov <alex@lib.msu.ru>
- Date: Tue, 6 Apr 2004 19:55:41 +0400
- To: "'Alan Kent'" <ajk@mds.rmit.edu.au>, "'ZIG'" <www-zig@w3.org>
Hi Alan! <warning> highly personal opinion </warning> I don't know if my information will be of any help for you or not, but I would like to briefly describe the project I've done for the Moscow State University Scientific Library last year. It covers many of the questions mentioned here. The aim of the project was to build a highly scalable world-wide z39.50 search engine, and I think we've reached our goal. We have created a free web-based library portal for searching Z39.50 servers that can search more than 1500 catalogs in parallel. You can try it online here: http://www.sigla.ru/?i18n=en. The main problem for the working implementation of a world-wide distributed search engine as we saw it from the beginning was mainly two-fold: 1. Current methods of querying many Z39.50 catalogs are no faster than the slowest participating server, because gateway is waiting for all servers to answer before presenting any records. 2. There are still major issues with interoperability between different Z39.50 servers: access points and attribute combinations differ in different implementations and databases. Each catalog requires its own customization of gateway to produce good results. Therefore, we have made 2 main decisions in our gateway design: 1. It should be as much asynchronous and parallel as possible. As a result, gateway should be able to display results as they come from each catalog and work with each result set separately. 2. There should be some fully automatic and dynamic engine for query reconfiguration if any search semantic problems arise. As a result, we have implemented a scheme for dynamic query reformulation: unsupported access points and/or attributes are simply removed from query, thus producing less accurate result, but it's still better than plain 'diag error'. Actual field experience showed that dropping unsupported pieces of query is justified: it is the best thing that can be done in most situations that will produce the best result possible according to the initial request. As for actual implementation, several interesting technical features were implemented to build the final product. The asynchronous behavior was done in a 'working thread pool' manner (active working threads, passive z39.50 connection objects). A pool is dynamically populated with threads as needed, but can't grow over 1000 threads. Each thread is taking a task from a task queue (one of the 'search', 'sort', 'fetch' or 'browse' for 1 catalog) and performs necessary actions with remote catalogue in a standard synchronous manner. Therefore there is no need for any additional developments in any of existing Z39.50 servers. Any results those are already available are immediately displayed to the user and can be used without waiting for all distributed search to complete. This asynchronous and separate processing of each catalogue also has another nice aspect: there is no need for strict agreements on common decisions between catalogues. Gateway uses the native format for bibliographic data for each catalog, different character encoding schemes and query reconfiguration is working with each catalog specifically. So, my opinion about making distributed search in Z39.50 is the following: 1. Keep communication protocols as simple as possible - it will broaden its usage and will allow more people to participate in building the infrastructure 2. Search servers should only be responsible for a set of very simple and independent small tasks - this will keep everything clear and allow more optimization techniques to be applied in the future 3. Any sophistication or complexity of end-user solutions should be implemented in portals, GUI applications & other similar client stuff. This is how we can actually build a diverse distributed infrastructure of information databases. Other methods are too complicated to be accepted by a wider community. Therefore, there is actually no real need in 'partitial' results implementation, except that you want to use other metasearch engine like Sigla in your application. But is another question, and I'm now investigating possible ways of doing this kind of thing as simple as possible. I've already got some practical results you can try online - they are implemented as an extension to SRU protocol. I can provide you with more detailed information if you are interested. Best regards, Alex Khokhlov.
Received on Tuesday, 6 April 2004 11:55:38 UTC