- From: Babich, Alan <ABabich@filenet.com>
- Date: Fri, 24 Jul 1998 13:11:47 -0700
- To: "'Jim Davis'" <jdavis@parc.xerox.com>, www-webdav-dasl@w3.org
I assume that "contains" is going to be explicitly called out as a "pass through" operator for DASL 1.0. Then, "contains" would support any of the engines you mentioned to the extent that a string can be mapped into their API. My guess is that for those systems that can take a document as a query, this is a particularly straightforward mapping. Some engines can directly accept a string with embedded syntax that can specify stemming, proximity, case sensitivity, soundex, and Boolean combinations of conditions. Of course, "contains" wouldn't be interoperable, because we haven't provided a way to advertise what the string syntax and semantics are. Having had the advantage of working on this problem in other contexts, I think that what the "second grammar" would consist of would be simplesearch version 2. That would be the same as simplesearch version 1 with additional optional content search operators, some score type properties (STARTS has normalized and raw score properties, for example), and some constraints advertised in the QSD on the overall form of the where element. We could also optionally define one or more hit highlighting property formats, but that's a bigger rathole than the rest of it combined. The existing AND, OR, and NOT operators could be used to advantage for Boolean combinations by some collections, and some degree of AND and OR can be embedded in the operator's operands as well. (For example, STARTS has an operator with a list of words as one operand, and another operand that says whether all or some of the words must be in the document. This is essentially equivalent to AND or OR.) We could define any number of generalized and/or engine specific full text search operators, and accommodate any number of engines. The QSD would, of course, advertise what operators are available for a particular collection. The only significant downside to inventing optional operators is the time it takes. Of course, the only operators really useful for interoperability would be the generalized ones, and these would capture only least common denominator functionality. We should start there. In fact, we should probably start with STARTS. I'm not saying when we should start on this. Alan Babich > -----Original Message----- > From: Jim Davis [mailto:jdavis@parc.xerox.com] > Sent: July 24, 1998 11:14 AM > To: www-webdav-dasl@w3.org > Subject: Towards DASL fulltextsearch query > > > The DASL simplesearch grammar, while supporting SQL fairly > well, does a > poor job of supporting full text search engines such as Verity, WAIS, > SMART, or MG. > > For some such engines, a query is a document, or at least a > lengthy portion > of text, rather than a set of expressions on fields joined by > Boolean values. > For others the query is a small set of words, and the query > may specify the > maximum allowable distance between words in the target documents (e.g. > within N words, in the same sentence, or in the same paragraph). The > result is a set of documents ordered according to similarity > to the query. > Typically there is a cutoff in the number of documents > returned, but in > principle the similarity is computed for every document in the corpus. > Usually a numeric score is returned for each document. > > Many of these systems also allow the client to specify choice of token > processing (e.g. stemming), the matching rules (soundex, left or right > truncation), and/or to influence the ranking by providing > weights on terms > used in the search. > > None of these are well supported in the DASL simplesearch > grammar, and I > don't think they should be. > > For one thing, there is no common practise to standardize on > for queries > that work on both boolean and full text engines. (STARTS is the best > attempt so far.) > > Even if we succeeded in defining it, the result would not be > a *simple* > search grammar, and I think the likely outcome would be that typical > implementations of DASL simplesearch would either support > the boolean side > well, or the fulltext side well, but not both. So in > practice, a client > would do query schema discovery to find out which kind > worked, and once it > does that, there's no real difference between doing QSD on > one grammar, and > grammar discovery (via OPTIONS) on the arbiter itself. In > other words, > rather than make a complicated simplesearch that can express > both kinds of > search, leave simplesearch alone, and define a fulltextsearch. > > This is not to say that there should be NO content search at > all in DASL, > to the contrary, there should, but it should be quite limited. > > It's really a call to begin thinking about defining a second > grammar, which > may or may not make it into the first DASL specification. >
Received on Friday, 24 July 1998 16:14:47 UTC