RE: Towards DASL fulltextsearch query from Babich, Alan on 1998-07-24 (www-webdav-dasl@w3.org from July to September 1998)

From: Babich, Alan <ABabich@filenet.com>
Date: Fri, 24 Jul 1998 13:11:47 -0700
To: "'Jim Davis'" <jdavis@parc.xerox.com>, www-webdav-dasl@w3.org
Message-ID: <72B1992276A9D111A20E00805FEAC96D01324C9A@cm-expo1.filenet.com>
I assume that "contains" is going to be explicitly
called out as a "pass through" operator for DASL 1.0.
Then, "contains" would support any of the engines 
you mentioned to the extent that a string can
be mapped into their API. My guess is that for
those systems that can take a document as
a query, this is a particularly straightforward
mapping. Some engines can directly accept
a string with embedded syntax that can specify 
stemming, proximity, case sensitivity, soundex, and 
Boolean combinations of conditions. Of course,
"contains" wouldn't be interoperable, because
we haven't provided a way to advertise what
the string syntax and semantics are.

Having had the advantage of working on
this problem in other contexts, I think
that what the "second grammar" would consist
of would be simplesearch version 2. That would be
the same as simplesearch version 1 with
additional optional content search
operators, some score type properties 
(STARTS has normalized and raw score properties, 
for example), and some constraints advertised
in the QSD on the overall form of the where element.
We could also optionally define one or more
hit highlighting property formats, but that's
a bigger rathole than the rest of it combined.

The existing AND, OR, and NOT operators
could be used to advantage for Boolean
combinations by some collections, and some degree 
of AND and OR can be embedded in the operator's
operands as well. (For example, STARTS has an
operator with a list of words as one
operand, and another operand that says
whether all or some of the words must
be in the document. This is essentially
equivalent to AND or OR.)

We could define any number of generalized
and/or engine specific full text search
operators, and accommodate any number of engines. 
The QSD would, of course, advertise what 
operators are available for a particular 
collection. The only significant downside to 
inventing optional operators is the time it takes.
Of course, the only operators really useful
for interoperability would be the generalized
ones, and these would capture only least common
denominator functionality. We should start
there. In fact, we should probably start with STARTS.
I'm not saying when we should start on this.

Alan Babich

> -----Original Message-----
> From: Jim Davis [mailto:jdavis@parc.xerox.com]
> Sent: July 24, 1998 11:14 AM
> To: www-webdav-dasl@w3.org
> Subject: Towards DASL fulltextsearch query
> 
> 
> The DASL simplesearch grammar, while supporting SQL fairly 
> well, does a
> poor job of supporting full text search engines such as Verity, WAIS,
> SMART, or MG. 
> 
> For some such engines, a query is a document, or at least a 
> lengthy portion
> of text, rather than a set of expressions on fields joined by 
> Boolean values. 
> For others the query is a small set of words, and the query 
> may specify the
> maximum allowable distance between words in the target documents (e.g.
> within N words, in the same sentence, or in the same paragraph).   The
> result is a set of documents ordered according to similarity 
> to the query.
> Typically there is a cutoff in the number of documents 
> returned, but in
> principle the similarity is computed for every document in the corpus.
> Usually a numeric score is returned for each document.
> 
> Many of these systems also allow the client to specify choice of token
> processing (e.g. stemming), the matching rules (soundex, left or right
> truncation), and/or to influence the ranking by providing 
> weights on terms
> used in the search.
> 
> None of these are well supported in the DASL simplesearch 
> grammar, and I
> don't think they should be.
> 
> For one thing, there is no common practise to standardize on 
> for queries
> that work on both boolean and full text engines.  (STARTS is the best
> attempt so far.)
> 
> Even if we succeeded in defining it, the result would not be 
> a *simple*
> search grammar, and I think the likely outcome would be that typical
> implementations of DASL simplesearch  would either support 
> the boolean side
> well, or the fulltext side well, but not both.  So in 
> practice, a client
> would do query schema discovery to find out which kind 
> worked, and once it
> does that, there's no real difference between doing QSD on 
> one grammar, and
> grammar discovery (via OPTIONS) on the arbiter itself.  In 
> other words,
> rather than make a complicated simplesearch that can express 
> both kinds of
> search, leave simplesearch alone, and define a fulltextsearch.
> 
> This is not to say that there should be NO content search at 
> all in DASL,
> to the contrary, there should, but it should be quite limited.
> 
> It's really a call to begin thinking about defining a second 
> grammar, which
> may or may not make it into the first DASL specification.
>
Received on Friday, 24 July 1998 16:14:47 UTC