Re: Distributed searches in Z39.50? from Alan Kent on 2004-04-02 (www-zig@w3.org from April 2004)

From: Alan Kent <ajk@mds.rmit.edu.au>
Date: Fri, 2 Apr 2004 17:31:36 +1000
To: Kevin Gamiel <kgamiel@cnidr.org>
Cc: ZIG <www-zig@w3.org>
Message-ID: <20040402073136.GP24242@io.mds.rmit.edu.au>
On Fri, Apr 02, 2004 at 01:06:44AM -0500, Kevin Gamiel wrote:
> >One of the problems in implementing a Z39.50 distributed search server
> >is a search request has to return the exact number of hits in the
> >search response packet.  The exact number of records in the final
> 
> Says who?  Rule number one: server choice.  Rule number two: profiles. 
> Seems to me, the mechanics for doing what you want are all in place.

I am not 100% sure what you mean by the above sorry. Do you mean that
you think Z39.50 has enough flexibility to do this functionality
without change? I do not see any logical way to do it with the current
protocol. Or are you instead saying there are some simple ways of
defining another EXTERNAL and dropping it into the existing protocol
(at several possible points) and then writing a profile that describes
the EXTERNAL and when to use it? (The latter I agree with.)

> We 
> did this with Isite, we had a "search engine" plugin that was really a 
> distributed Z39.50 client.  I remember thinking about how to fold such 
> functionality into the standard model, but never made much progress.  I 
> *think* concurrent operations were invented for just such a case, but I 
> could be wrong.

Concurrent operations allow a client to send multiple requests without
waiting for responses down one socket. There are other side effects too,
related to what and when the server is allowed to send things.

I would like to expose the distributed collection as a single database
name that you search on (to keep clients easy). That is, have 'sub databases'
or 'composite databases' (terminology from the Explain record describing
a database). For example, a server might expose a single database name
'all-libraries-around-the-world'.

> Otherwise, there are (at least) two remaining issues. 
> First, do you want existing clients to work with this model and second, 
> if you don't care about existing clients, what's the best way to expose 
> this type of functionality.

If you are willing to wait for all servers to respond, nothing needs
to change in the current protocol at all. The question is how to
incrementally tell the user of the status of the search while it is
still in progress.

Existing clients by definition cannot not support displaying incremental
progress of searches as there is no agreed to way of doing this in Z39.50.
So, clients must change. Whatever was done would have to be a profile,
probably some new ASN.1 EXTERNAL of some kind, and then both a server
and client would have to be extended to support it.

> Otherwise, at least at a superficial level, I think it involves using 
> existing PDUs plus profiles and we can all think of a thousand ways to 
> negotiate and implement it.

Yes. Several schemes roll off the toungue easily. But I will try to
avoid boring people with them (yet! ;-). The requirements are much more
interesting at this stage.

> Would it be useful?  I think the answer is 
> clearly "yes".  In the past, we usually just hang the search until 
> either all results came back or a timeout occurred and we truncated 
> results, etc (yuck).

I agree, yuck. More than yuck actually, as all queries will take as long
as the slowest server, and in a world-wide search, that can be SLOW!
This is one of the reasons succesful distributed search applications I
think have always been done by clients.

> But, what would it take to do this correctly?  If you view the world as 
> XML folks tend to, then everything is a tree and sending a query to a 
> node will potentally branch to n nodes, ad nausium.

I had not thought of having a tree, but rather a flat list of 'database X
expands to A, B, C, D etc'. Turning it into a tree would not be hard
I guess as when 'A' replies, it can return details for P, Q, R which
all get nested in the 'A' details. So an individual server does not
need to understand the full tree - it just understands its immediate
children and how to nest responses it gets back. Interesting though.

> It requires dynamic feedback from each 
> node, discoverying the topology in realtime, possibly based on the query 
> itself.  Then it becomes an old-school query routing problem, a whois++ 
> delegated query problem, which becomes a management problem, etc, etc.

I was not thinking of dynamically changing what to search. I agree thats
a hard problem - but I think its orthogonal to the problem I am
describing of incremental search progress notification. What the server
chooses to search is up to it.


Hmmm. Resource reports, resource controls, extended services, other-info:
I agree there is ample scope to add in something using an EXTERNAL.
I think the next question then is whether it matters if the client has
to poll for updates or whether the server should squirt async messages
at the client when it discovers something new to tell the client.
The answer to this question will determine which existing Z39.50 protocol
features could be used to support a profile. I do think its important
to be able to do present requests before the search has finished running.
I think async messages sounds good, but I am not convinced its actually
important - what is wrong with a client polling the server once per second?
(Hmmm. Unless you have 1000 users of course!)

Is this what you meant by concurrent operations perhaps? Client sends
a Search request. Whenver the server thinks it appropriate, it sends
a ResourceContol to the client with updated information. Eventually
the server sends a Search response when all distributed searches have
completed. The client however needs to be allowed to do a present
request against the result set name before the search response comes
back, which requires support for concurrent operations.

Without concurrent operations, I suspect the client would have to poll
for updates on search progress no matter what Z39.50 features was
used. This seems a little CPU wasteful, but probably easier to
implement in practice. I am thinking of things like ZOOM APIs. Having
strict client-request/server-response pairs makes general purpose APIs
much easier to implement. It also avoids mandating 'event model' style
programming.

Alan
Received on Friday, 2 April 2004 02:33:22 UTC