W3C home > Mailing lists > Public > www-webdav-dasl@w3.org > July to September 1998

RE: Towards DASL fulltextsearch query

From: Dale Lowry <dlowry.ORM2-1.OREM2@gw.novell.com>
Date: Mon, 27 Jul 1998 10:37:53 -0600
Message-Id: <s5bc5891.067@GW.NOVELL.COM>
To: <www-webdav-dasl@w3.org>
I like the idea of the second grammar including the full simple search capabilities in addition to more advanced operators such as soundex, left or right truncation and proximity. For some search engines this full set of operators is valid in a single query against all properties of the object including the document content. We need to be careful to not think in terms of document properties being a separate animal from the document content.

Another approach would be to have a simplewhere and a fulltextwhere which can contain a nested simplewhere or fulltextwhere. In this scenario the set of operators for simplewhere and fulltextwhere could be disjoint. Also, the where(simple or full text), from and select elements would be siblings under searchrequest rather than under a simplesearch element. 

Even if we don't allow nesting of heterogeneous where elements, do the from and select clauses really belong under the simplesearch element? It seems that the from and select elements exist independent of whether the query is simple or full text.

Dale A. Lowry ( dlowry@novell.com )
Novell GroupWise Document Management

>>> "Babich, Alan" <ABabich@filenet.com> 07/24 2:14 PM >>>
I assume that "contains" is going to be explicitly
called out as a "pass through" operator for DASL 1.0.
Then, "contains" would support any of the engines 
you mentioned to the extent that a string can
be mapped into their API. My guess is that for
those systems that can take a document as
a query, this is a particularly straightforward
mapping. Some engines can directly accept
a string with embedded syntax that can specify 
stemming, proximity, case sensitivity, soundex, and 
Boolean combinations of conditions. Of course,
"contains" wouldn't be interoperable, because
we haven't provided a way to advertise what
the string syntax and semantics are.

Having had the advantage of working on
this problem in other contexts, I think
that what the "second grammar" would consist
of would be simplesearch version 2. That would be
the same as simplesearch version 1 with
additional optional content search
operators, some score type properties 
(STARTS has normalized and raw score properties, 
for example), and some constraints advertised
in the QSD on the overall form of the where element.
We could also optionally define one or more
hit highlighting property formats, but that's
a bigger rathole than the rest of it combined.

The existing AND, OR, and NOT operators
could be used to advantage for Boolean
combinations by some collections, and some degree 
of AND and OR can be embedded in the operator's
operands as well. (For example, STARTS has an
operator with a list of words as one
operand, and another operand that says
whether all or some of the words must
be in the document. This is essentially
equivalent to AND or OR.)

We could define any number of generalized
and/or engine specific full text search
operators, and accommodate any number of engines. 
The QSD would, of course, advertise what 
operators are available for a particular 
collection. The only significant downside to 
inventing optional operators is the time it takes.
Of course, the only operators really useful
for interoperability would be the generalized
ones, and these would capture only least common
denominator functionality. We should start
there. In fact, we should probably start with STARTS.
I'm not saying when we should start on this.

Alan Babich

> -----Original Message-----
> From: Jim Davis [mailto:jdavis@parc.xerox.com] 
> Sent: July 24, 1998 11:14 AM
> To: www-webdav-dasl@w3.org 
> Subject: Towards DASL fulltextsearch query
> The DASL simplesearch grammar, while supporting SQL fairly 
> well, does a
> poor job of supporting full text search engines such as Verity, WAIS,
> SMART, or MG. 
> For some such engines, a query is a document, or at least a 
> lengthy portion
> of text, rather than a set of expressions on fields joined by 
> Boolean values. 
> For others the query is a small set of words, and the query 
> may specify the
> maximum allowable distance between words in the target documents (e.g.
> within N words, in the same sentence, or in the same paragraph).   The
> result is a set of documents ordered according to similarity 
> to the query.
> Typically there is a cutoff in the number of documents 
> returned, but in
> principle the similarity is computed for every document in the corpus.
> Usually a numeric score is returned for each document.
> Many of these systems also allow the client to specify choice of token
> processing (e.g. stemming), the matching rules (soundex, left or right
> truncation), and/or to influence the ranking by providing 
> weights on terms
> used in the search.
> None of these are well supported in the DASL simplesearch 
> grammar, and I
> don't think they should be.
> For one thing, there is no common practise to standardize on 
> for queries
> that work on both boolean and full text engines.  (STARTS is the best
> attempt so far.)
> Even if we succeeded in defining it, the result would not be 
> a *simple*
> search grammar, and I think the likely outcome would be that typical
> implementations of DASL simplesearch  would either support 
> the boolean side
> well, or the fulltext side well, but not both.  So in 
> practice, a client
> would do query schema discovery to find out which kind 
> worked, and once it
> does that, there's no real difference between doing QSD on 
> one grammar, and
> grammar discovery (via OPTIONS) on the arbiter itself.  In 
> other words,
> rather than make a complicated simplesearch that can express 
> both kinds of
> search, leave simplesearch alone, and define a fulltextsearch.
> This is not to say that there should be NO content search at 
> all in DASL,
> to the contrary, there should, but it should be quite limited.
> It's really a call to begin thinking about defining a second 
> grammar, which
> may or may not make it into the first DASL specification.

Received on Monday, 27 July 1998 12:38:05 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:22:40 UTC