Re: Proposal: Searching XML from Alan Kent on 2002-04-22 (www-zig@w3.org from April 2002)

From: Alan Kent <ajk@mds.rmit.edu.au>
Date: Mon, 22 Apr 2002 11:42:54 +1000
To: Liam Quin <liam@w3.org>, www-zig@w3.org
Message-ID: <20020422114253.H23301@io.mds.rmit.edu.au>
On Sat, Apr 20, 2002 at 11:27:15PM +0100, Robert Sanderson wrote:
> Yep.  We'd implement it for sure.  On the other hand, the practical 
> advantages of it aren't as high as you might expect. Unless you know what 
> the data is like, you can't really send a sensible XPATH search. 
> If without prior knowledge of the database, you can't send a useful XPATH 
> search, then you might as well just configure enumerated access points.
> (Which is what we do now, mapped to XPATH (almost) in the configfile)

We have thought about how to include XPATH queries a number of times.
Here is our thoughts on it (which overlaps with what other people have
posted).

We actually allow more than one lump of XML in a single record
(You can have multiple GRS-1 fields in one record holding XML).
So one approach proposed internally was to allow *within* an
attribute, an XPATH expression to be specified. That is, have
a special query term format (EXTERNAL is in there after all as
a query term) that searched an XPATH expression within a single
attribute. That way separate attributes can be bound to the separate
XML fields.

But I agree with the problems raised by others. XPATH has a quite
different model to querying than Z39.50. You can do joins. You can
specify the top node for a query which is different from the root
node of the parse tree (so one document may be split into lots
of little fragments). One work around was to say 'XPATH is only
used to determine if the record matches'. This kept it within
the Z39.50 model.

But the real problem we had was a conceptual one. As others have
said also, Z39.50 abstracts the query model from the physical
representation. To me this is fundamental to the protocol.
Its the differentiating factor. As soon as you tie queries to
the physical data format, its a big step away from this abstraction
model.

But there is a way to maintain the current model. This is for there
to be an abstract XML structure that is defined as a part of the
abstract record structure. This XML structure does not have to be
the same as the physical representation of the data. (In practical
terms, you can think of an XSLT stylesheet converting the underlying
physical representation to the publically agreed to logical
representation for querying on.) Queries are expressed on the logical
XML structure. If it matches, the underlying record is returned.
This keeps the current Z39.50 separation intact.

But we have not got around to implementing it yet. I have not looked
for a while, but there were no standard text searching operators
(proximity, stemming, etc) in XPATH. Some people have gone off
and defined their own. Z39.50 has lots of features here. But there
does not seem to be any good way to fit the Z39.50 text query
features into XPATH at the lowest level.

Another icky thing about XPATH in a way is that there is so much
you can do with XPATH, its hard to enumerate all the queries possible.
By this I mean it has all sorts of join capabilties. Its easy to
write a query that an index cannot support. Do you then say such
queries are not allowed? Or require that a server do a brute force
false match check? Or define a subset of XPATH for use with Z39.50?
Seems a lot of potential for interoperability woes with people
implementing different subsets (whatever is easy for them).

Alan
Received on Sunday, 21 April 2002 21:43:46 UTC