Arbiter must merge the metadata of scope list elements

Heretofore, this hasn't been as explicit as it needs to be.
This is an attempt to clarify the situation.


LIST OF SCOPE ELEMENTS

Currently, in a DASL query, there is a list of scope elements.
Each scope element is a DAV:href element and an optional
DAV:depth element. A scope list element can, for example, 
reference a collection that is a whole document 
space under the control of a DMS, or it can reference a 
collection that is a particular (sub)folder in a 
document space, etc.

There is currently no constraint that the scope list elements
all be subcollections of the same root collection or document 
space, or even that the collections involved have to be on the
same server. Said another way, the search arbiter is currently 
free to forward the query to a list of heterogeneous 
document management systems on the same or separate servers
and merge the results.


COLLECTION

A (root) collection is an encapsulation of (1) resources, 
(2) metadata describing those resources, and (3) a query 
engine that both understands the metadata and has direct 
access to the resources. A (root) collection can have 
subcollections, which can have subcollections, etc. . 
Subcollections are obviously encapsulated by their 
root collection.


SEARCH ARBITER

The search arbiter performs the query by sending it to
all the scope elements, merges the query results returned
by each scope element, and returns the merged query results 
to the client.

If ordering is specified and is supported, then either (1) each
scope element must return ordered results to the arbiter, or 
(2) the search arbiter does all the sorting. In case (1), 
the arbiter does no sorting. It simply merges the sorted results 
from each scope element. In case (2), the arbiter does the 
sorting. In either case, to perform the merge or sort, the 
search arbiter software must understand the datatypes of
the properties of each scope element, since the sort is done
on an element basis. 

(*1: DASL is currently quiet on whether case (1) or 
case (2) pertains. DASL must become explicit on this issue.)

Implicit in all the above behavior is that the search arbiter
software, when asked, (1) retrieves and saves the metadata of 
each scope list element, (2) merges the metadata of all the 
scope elements, and (3) returns the merged metadata to the 
client. 

The merged metadata is returned to the client when the
client software requests the query capabilities of the 
particular scope list submitted to the arbiter software.

(*2: The exact details of how the client retrieves the metadata
is work in progress. The search arbiter will use this same
method to get the metadata of each individual scope list 
element.)


METADATA

The metadata of a particular query consists of (1) the 
properties, (2) the query operators, and the (3) query grammars 
supported by the scope list elements.

The metadata of each scope list element can be exactly the same, 
mostly similar to, or mostly different from the metadata of 
other scope list elements. The most useful case is where there 
is a fair amount of overlap in the metadata of the scope
list elements, but the metadata is not identical. For example,
scope list element "A" may have the "Author" and "loan_number"
properties, and scope list element "B" may have the "Author"
and "purchase_order_number properties". In this example,
both scope list elements have some common properties
and some different properties. Scope list element "A" supports 
the "contains" operator, and scope list element "B" does
not support the "contains" operator. So, the scope list elements
of this example have query operators in common, and query 
operators that are different.


MERGING METADATA

There are two reasonable approaches to merging the metadata
of the scope list elements: (1) take the set intersection
of the query capabilities of each scope list element, and 
(2) take the set union of the query capabilities of each scope
list element. 

INTERSECTION: Under intersection rules, the merged metadata
consists only of (1) the properties that are common to all
scope list elements, (2) the query operators that are common
to all scope list elements, and (3) the query grammars
that are common to all scope list elements. This is the easy
case for the search arbiter software, since client queries 
can be forwarded to each scope list element unmodified.
A problem with this case is that, in general, one can't
rely on the intersection being as large as one would like,
unless the situation has been prearranged to make all the
schemas identical. (This eliminates querying across
heterogeneous legacy repositories, which, almost by definition,
had their schemas defined independently of each other.)

UNION: Under union rules, the merged metadata consists of
(1) the set union of the properties of all the scope list
elements, (2) the set union of the query operators of all
the scope list elements, and (3) the set intersection of
the query grammars supported by the scope list elements.
(The intersection of the grammars must be taken even under
union merge, since the search arbiter software must send 
the query to all the scope list elements, so all the scope 
list elements must understand the grammar being used.) 
The union case probably has broader applicability in the 
real world than the intersection case.

Since there is only one glob of metadata returned
to the client for any particular scope list, the client's 
view of the world is the same whether there is one or 
multiple scope list elements in the query.

In the intersection case, all client queries are fully
understood by all scope list elements. In the union case, 
in general, client queries are only partially defined for 
each scope list element. The problems this causes are easily
solved by the search arbiter's use of three valued elimination. 
(As discussed before, three valued elimination is an obvious 
straightforward extension of ANSI standard SQL three valued 
logic.) Sensible query results are returned for all queries 
that are valid with respect to the metadata returned to the 
client.

(*3: DASL is currently quiet on whether union or intersection
rules are used when merging metadata. DASL must be explicit
about this.)


ALTERNATIVES

The current situation requires the search arbiter to be
fully general: It must collect and merge metadata, perform
three valued elimination, merge query results from multiple 
scope list elements, etc.

Alternatives to the current situation are: 

(1) constrain the scope list to be a single element, and 

(2) constrain the scope list elements to be subcollections 
of the same collection, so that they all have exactly the 
same metadata. 

Then the search arbiter disappears. Said another way, the 
root collection becomes the arbiter for itself and its
subcollections. In either case, there is only one 
metadata description for all the scope list 
elements of a query, so there is no concept of merging the 
metadata of the scope list elements, no three valued 
elimination, no forwarding of the query to multiple 
different servers, etc.

If alternative (1) is chosen, it would nonetheless be possible 
for "anyone" to write a GENERIC SEARCH ARBITER that could
interface to any list of DASL collections. The GSA would enhance
the 1.0 DASL protocol to take a LIST of scope elements 
(instead of a single scope list element), collect the metadata 
from each scope list element, merge the metadata, return the 
merged metadata to the client, distribute queries to each 
scope list element, merge query results from all scope list 
elements, and return a single set of query results to the 
client. In fact, doing that wouldn't be difficult, because 
the generic arbiter would use the 1.0 DASL protocol unmodified 
to talk to the clients and the individual scope list elements.

No special enhancements to the DASL 1.0 protocol
would be necessary to enable the implementation of such a 
generic search arbiter. (I am assuming the generic arbiter 
would always use union merge. If intersection merge were 
deemed desirable as well, the generic arbiter would enhance the 
1.0 protocol one step further to allow the client to specify 
intersection versus union merge when retrieving the metadata.)


ISSUES (ACTION ITEMS)

Not counting selecting an alternative as an issue, there 
are three DASL issues, flagged as *1, *2, *3, in the above.


                           ---

Received on Tuesday, 30 June 1998 18:16:17 UTC