Towards DASL fulltextsearch query

The DASL simplesearch grammar, while supporting SQL fairly well, does a
poor job of supporting full text search engines such as Verity, WAIS,
SMART, or MG. 

For some such engines, a query is a document, or at least a lengthy portion
of text, rather than a set of expressions on fields joined by Boolean values. 
For others the query is a small set of words, and the query may specify the
maximum allowable distance between words in the target documents (e.g.
within N words, in the same sentence, or in the same paragraph).   The
result is a set of documents ordered according to similarity to the query.
Typically there is a cutoff in the number of documents returned, but in
principle the similarity is computed for every document in the corpus.
Usually a numeric score is returned for each document.

Many of these systems also allow the client to specify choice of token
processing (e.g. stemming), the matching rules (soundex, left or right
truncation), and/or to influence the ranking by providing weights on terms
used in the search.

None of these are well supported in the DASL simplesearch grammar, and I
don't think they should be.

For one thing, there is no common practise to standardize on for queries
that work on both boolean and full text engines.  (STARTS is the best
attempt so far.)

Even if we succeeded in defining it, the result would not be a *simple*
search grammar, and I think the likely outcome would be that typical
implementations of DASL simplesearch  would either support the boolean side
well, or the fulltext side well, but not both.  So in practice, a client
would do query schema discovery to find out which kind worked, and once it
does that, there's no real difference between doing QSD on one grammar, and
grammar discovery (via OPTIONS) on the arbiter itself.  In other words,
rather than make a complicated simplesearch that can express both kinds of
search, leave simplesearch alone, and define a fulltextsearch.

This is not to say that there should be NO content search at all in DASL,
to the contrary, there should, but it should be quite limited.

It's really a call to begin thinking about defining a second grammar, which
may or may not make it into the first DASL specification.

Received on Friday, 24 July 1998 14:14:27 UTC