Proposal: Searching XML from Sebastian Hammer on 2002-04-20 (www-zig@w3.org from April 2002)

From: Sebastian Hammer <quinn@indexdata.dk>
Date: Sat, 20 Apr 2002 23:45:39 +0200
To: www-zig@w3.org
Message-Id: <4.2.0.58.20020420225842.01d12c58@bagel.indexdata.dk>
Hi,

This may have been discussed both in the plenary and in various subgroups, 
but if it has been proposed formally, I have missed it (if so, apologies in 
advance). Anyway...

I would like to propose that the ZIG decides upon a convention for 
modelling the potential set of searchable access points (within an 
application domain or profile) using XPATH Path Expressions. An example 
could be a domain-specific attribute set which defines searchable access 
points as [a subset of] all possible Path expressions that identify data 
elements within a given schema.

EXAMPLE:

Given a database record like this:

<book>
	<title>The catcher in the Rye</title>
	<author>J.D. Salinger</author>
	<subject vocabulary="LCSH">Fiction</subject>
</book>

One might like to pose a query like:

Find the word "catcher" in the field matching "title".

or:

Find the word "fiction" in fields matching "subject[@vocabulary='LCSH']"

Technically, the Path Expression would have to go into the "Complex" branch 
of the attributeValue, as a single string value, thus requiring support for 
version 3 of the protocol. I suggest that a similar mechanism be defined 
for SRW/U if it is not already in place. But do note that in the first 
round, there's no requirement to go beyond the current definition of the 
Type-1 query.

RATIONALE:

Outside of the library domain, Z39.50 is sometimes employed to support the 
networked IR requirements of different information domains. While in some 
cases, interoperability with libraries are a specific requirement, this is 
not always the case. Further, in more cases than not, the native, shared 
data models are already expressed in terms of XML. I suggest that in many 
cases where people consider Z39.50 for their application, it is an 
inhibiting factor that people feel forced to munge their existing data 
model into one that is suitable to Z39.50.. either by mapping elements 
(more or less) to Bib-1 attributes, or by declaring new, flat sets of 
numerical USE attributes... in either case, a process which introduces 
needless complexity in the documentation and maintenance of the 
domain-specific profile.

The attribute set architecture already provides a mechanism for expressing 
searches for hierarchically nested elements which can be seen as a subset 
of XPATH. However, its primary drawbacks are that the mechanism is 
comparatively unfamiliar to people versed in XML techniques, and it is less 
expressive than XPATH (for instance, constraints on attribute values, as 
above, are probably not uncommon, yet have no clear parallel in the AA or 
XD-1).

Of course there's nothing stopping a profile author from defining abstract 
access points which don't correspond directly to individual fields (such as 
an "ANY" attribute, or database record metadata). Similarly, there's 
nothing stopping a profile from defining a crosswalk to existing, 
conventionel attribute sets and requiring support for these.

We have already embraced XML as a bona fide record syntax (eg. in the Bath 
profile). I think the next logical step is to also allow searches to be 
expressed in terms comfortable to the wider community -- without 
sacrificing the power of the Type-1 query.

WHY BOTHER?

1) Because, by softening some of the library-traditional ways of using 
Z39.50 and showing how it can be used easily without bending your existing 
data model out of shape, we can help break down the misconception that it's 
not XML-friendly. Will it "sell" better? It still won't win the world, but 
you can't miss the fact that the W3C *still* doesn't have a suitable IR 
protocol. Maybe the slot is still open.

2) Because it's a natural progression from the move to support XML as a 
retrieval syntax and even from the thinking behind SRW.

3) Because it is happening anyway. I believe there's several ZIG members 
who make search engines which are natively oriented around XML-like data 
models, and who market their products to a broader range of products. 
Speaking for ourselves, we *need* to do this, or else abandon Z39.50 
completely, and I suspect others may be in the same position. Why not do it 
in an interoperable way?

FUTURE STEPS?

It'd be useful to consider extending the Type-1 query to allow the use of 
an URI instead of an OID to denote the "attribute set". You will most 
likely want to mix and match the two type of identifiers -- there's no 
clear reason to abandon the Utility attribute set for the non-USE attributes.

Cheers,

--Sebastian
--
Sebastian Hammer, Index Data <http://www.indexdata.dk/>
Ph: +45 3341 0100, Fax: +45 3341 0101
Received on Saturday, 20 April 2002 17:44:58 UTC