Re: Proposal: Searching XML from Sebastian Hammer on 2002-04-22 (www-zig@w3.org from April 2002)

From: Sebastian Hammer <quinn@indexdata.dk>
Date: Mon, 22 Apr 2002 08:55:30 +0200
To: Alan Kent <ajk@mds.rmit.edu.au>, www-zig@w3.org
Message-Id: <4.2.0.58.20020422083908.01a457d8@bagel.indexdata.dk>
At 11:42 22-04-2002 +1000, Alan Kent wrote:

>We actually allow more than one lump of XML in a single record

Tricky...but perhaps irrelevant to the proposal. You're mixing up search 
and retrieval, here. The proposal assumes that the server knows how to 
model its content using an XML-friendly data model. That you mix multiple 
XML records into one document might be modelled using an abstract schema 
that places each document in a different subtree of a virtual superrecord.. 
assuming you want interoperability around your model.

>But I agree with the problems raised by others. XPATH has a quite
>different model to querying than Z39.50. You can do joins. You can
>specify the top node for a query which is different from the root
>node of the parse tree (so one document may be split into lots
>of little fragments). One work around was to say 'XPATH is only
>used to determine if the record matches'. This kept it within
>the Z39.50 model.

Yes. The proposal goes further than this, and only uses XPATH to identify 
sets of elements to match against query terms in the usual way. The actual 
match would be governed by the usual control attributes.

>But the real problem we had was a conceptual one. As others have
>said also, Z39.50 abstracts the query model from the physical
>representation. To me this is fundamental to the protocol.
>Its the differentiating factor. As soon as you tie queries to
>the physical data format, its a big step away from this abstraction
>model.

Yes. Please note again that the proposal does not assume such a tie. It 
*allows* it, which is in harmony with the softening of the search/retrieve 
split which is inherent in the architecture and subsequent work, eg. on the 
MARC attribute set.

>But there is a way to maintain the current model. This is for there
>to be an abstract XML structure that is defined as a part of the
>abstract record structure. This XML structure does not have to be
>the same as the physical representation of the data. (In practical
>terms, you can think of an XSLT stylesheet converting the underlying
>physical representation to the publically agreed to logical
>representation for querying on.) Queries are expressed on the logical
>XML structure. If it matches, the underlying record is returned.
>This keeps the current Z39.50 separation intact.

Precisely. It may be as simple as a profile defining (or referring to!) two 
abstract schemas -- one for searching, and one for retrieval. This is 
precisely what we do today, only a set of USE attributes is customarily a 
flat list of numbers, which enforces an abstraction which is deemed 
unpleasant by some. I suspect ambitious profiles defined with 
interoperability in mind will go the dual-schema route or stick to 
numerical attributes. But we're opening up Z39.50 to a whole truckload of 
simpler and/or ad hoc applications -- without necessarily changing a bit of 
the protocol spec.

>But we have not got around to implementing it yet. I have not looked
>for a while, but there were no standard text searching operators
>(proximity, stemming, etc) in XPATH. Some people have gone off
>and defined their own. Z39.50 has lots of features here. But there
>does not seem to be any good way to fit the Z39.50 text query
>features into XPATH at the lowest level.

I actually think the Type-1 query offers an excellent conceptual framework 
for doing this (a string-valued attribute) -- we just have to use it.

>Another icky thing about XPATH in a way is that there is so much
>you can do with XPATH, its hard to enumerate all the queries possible.
>By this I mean it has all sorts of join capabilties. Its easy to
>write a query that an index cannot support. Do you then say such
>queries are not allowed? Or require that a server do a brute force
>false match check? Or define a subset of XPATH for use with Z39.50?
>Seems a lot of potential for interoperability woes with people
>implementing different subsets (whatever is easy for them).

This is indeed problematic, although you can argue it's not much different 
from the present state of things with Bib-1 (hands up all who support 
regular expressions, stemming and phonetic searches -- just kidding). I 
think the answer lies in profiles and/or attribute sets listing precise 
XPATH expressions that are relevant to support a given application, and 
which must be supported by all compliant servers. This approach also leaves 
servers free to decide whether to actually implement XPATH in their engines 
(hard, in some cases), or whether simply to match the XPATH expressions as 
if they were simple identifiers. In the latter case, the additional work 
required to bring exisitng servers up to speed would be very moderate... 
but remember again, my proposal was not primarily directed at existing 
bibliographic servers but rather at opening up the board for easier 
adoption by other application areas.

--Sebastian
--
Sebastian Hammer, Index Data <http://www.indexdata.dk/>
Ph: +45 3341 0100, Fax: +45 3341 0101
Received on Monday, 22 April 2002 02:54:47 UTC