Re: Distributed searches in Z39.50? from Sebastian Hammer on 2004-04-02 (www-zig@w3.org from April 2004)

From: Sebastian Hammer <quinn@indexdata.dk>
Date: Fri, 02 Apr 2004 10:31:02 +0200
To: ZIG <www-zig@w3.org>
Message-Id: <5.1.1.6.2.20040402101605.03ab9008@www.indexdata.dk>
We've done a couple of projects, as research, that involved just this kind 
of parallel searching, 'hidden' behind a Z39.50 server.

I think the most obvious solution, if you want to stay close to what's in 
the current standard, involves a combination of concurrent operations and 
asynchronous resource control requests coming back from the server.    The 
User Information format SearchResult-1 already provides most of the data 
elements you would want to describe partial search results, and if you use 
this together with resource control, it seems clear to me that the intent 
is that the client should be able to issue present requests on the partial 
results, even  before the final search response has been received.

One neat aspect of this approach, which makes use exclusively of data 
structures already defined in the standard, is that it degrates pretty 
gracefully for exiting clients. Ie. if the client doesn't support resource 
control, it won't get the asynchronous messages -- it will simply wait till 
the last server has responded.

The funny thing is, if you put the pieces together, it sure seems like 
someone in the ZIG community was thinking of this exact type of 
application, but there hasn't been a lot of stories of people doing it in 
anger. We prototyped this approach in a research project in collaboration 
with Fretwell-Downing, but other than that, I have not heard of 
interoperable solutions. I think it's (yet) another area where Z39.50-1995 
was really too far ahead of its time for its own good.

One of the appealing aspects of the approach, at the time, was the 
possibility of 'staggering' Z39.50 parallel-search agents on top of 
eachother to be able to search, potentially, thousands of targets in a 
bandwidth-efficient way. My sense, however, is that at present, it's not 
the available bandwidth that limits what we can do in terms of parallel 
searching.

Still, it's pretty darn fun stuff.

--Sebastian

At 17:31 02-04-2004 +1000, Alan Kent wrote:

>On Fri, Apr 02, 2004 at 01:06:44AM -0500, Kevin Gamiel wrote:
> > >One of the problems in implementing a Z39.50 distributed search server
> > >is a search request has to return the exact number of hits in the
> > >search response packet.  The exact number of records in the final
> >
> > Says who?  Rule number one: server choice.  Rule number two: profiles.
> > Seems to me, the mechanics for doing what you want are all in place.
>
>I am not 100% sure what you mean by the above sorry. Do you mean that
>you think Z39.50 has enough flexibility to do this functionality
>without change? I do not see any logical way to do it with the current
>protocol. Or are you instead saying there are some simple ways of
>defining another EXTERNAL and dropping it into the existing protocol
>(at several possible points) and then writing a profile that describes
>the EXTERNAL and when to use it? (The latter I agree with.)
>
> > We
> > did this with Isite, we had a "search engine" plugin that was really a
> > distributed Z39.50 client.  I remember thinking about how to fold such
> > functionality into the standard model, but never made much progress.  I
> > *think* concurrent operations were invented for just such a case, but I
> > could be wrong.
>
>Concurrent operations allow a client to send multiple requests without
>waiting for responses down one socket. There are other side effects too,
>related to what and when the server is allowed to send things.
>
>I would like to expose the distributed collection as a single database
>name that you search on (to keep clients easy). That is, have 'sub databases'
>or 'composite databases' (terminology from the Explain record describing
>a database). For example, a server might expose a single database name
>'all-libraries-around-the-world'.
>
> > Otherwise, there are (at least) two remaining issues.
> > First, do you want existing clients to work with this model and second,
> > if you don't care about existing clients, what's the best way to expose
> > this type of functionality.
>
>If you are willing to wait for all servers to respond, nothing needs
>to change in the current protocol at all. The question is how to
>incrementally tell the user of the status of the search while it is
>still in progress.
>
>Existing clients by definition cannot not support displaying incremental
>progress of searches as there is no agreed to way of doing this in Z39.50.
>So, clients must change. Whatever was done would have to be a profile,
>probably some new ASN.1 EXTERNAL of some kind, and then both a server
>and client would have to be extended to support it.
>
> > Otherwise, at least at a superficial level, I think it involves using
> > existing PDUs plus profiles and we can all think of a thousand ways to
> > negotiate and implement it.
>
>Yes. Several schemes roll off the toungue easily. But I will try to
>avoid boring people with them (yet! ;-). The requirements are much more
>interesting at this stage.
>
> > Would it be useful?  I think the answer is
> > clearly "yes".  In the past, we usually just hang the search until
> > either all results came back or a timeout occurred and we truncated
> > results, etc (yuck).
>
>I agree, yuck. More than yuck actually, as all queries will take as long
>as the slowest server, and in a world-wide search, that can be SLOW!
>This is one of the reasons succesful distributed search applications I
>think have always been done by clients.
>
> > But, what would it take to do this correctly?  If you view the world as
> > XML folks tend to, then everything is a tree and sending a query to a
> > node will potentally branch to n nodes, ad nausium.
>
>I had not thought of having a tree, but rather a flat list of 'database X
>expands to A, B, C, D etc'. Turning it into a tree would not be hard
>I guess as when 'A' replies, it can return details for P, Q, R which
>all get nested in the 'A' details. So an individual server does not
>need to understand the full tree - it just understands its immediate
>children and how to nest responses it gets back. Interesting though.
>
> > It requires dynamic feedback from each
> > node, discoverying the topology in realtime, possibly based on the query
> > itself.  Then it becomes an old-school query routing problem, a whois++
> > delegated query problem, which becomes a management problem, etc, etc.
>
>I was not thinking of dynamically changing what to search. I agree thats
>a hard problem - but I think its orthogonal to the problem I am
>describing of incremental search progress notification. What the server
>chooses to search is up to it.
>
>
>Hmmm. Resource reports, resource controls, extended services, other-info:
>I agree there is ample scope to add in something using an EXTERNAL.
>I think the next question then is whether it matters if the client has
>to poll for updates or whether the server should squirt async messages
>at the client when it discovers something new to tell the client.
>The answer to this question will determine which existing Z39.50 protocol
>features could be used to support a profile. I do think its important
>to be able to do present requests before the search has finished running.
>I think async messages sounds good, but I am not convinced its actually
>important - what is wrong with a client polling the server once per second?
>(Hmmm. Unless you have 1000 users of course!)
>
>Is this what you meant by concurrent operations perhaps? Client sends
>a Search request. Whenver the server thinks it appropriate, it sends
>a ResourceContol to the client with updated information. Eventually
>the server sends a Search response when all distributed searches have
>completed. The client however needs to be allowed to do a present
>request against the result set name before the search response comes
>back, which requires support for concurrent operations.
>
>Without concurrent operations, I suspect the client would have to poll
>for updates on search progress no matter what Z39.50 features was
>used. This seems a little CPU wasteful, but probably easier to
>implement in practice. I am thinking of things like ZOOM APIs. Having
>strict client-request/server-response pairs makes general purpose APIs
>much easier to implement. It also avoids mandating 'event model' style
>programming.
>
>Alan

--
Sebastian Hammer, Index Data <http://www.indexdata.dk/>
Ph: +45 3341 0100, Fax: +45 3341 0101
Received on Friday, 2 April 2004 03:33:51 UTC