Re: The minimum set from Rick Henderson on 1998-04-17 (www-webdav-dasl@w3.org from April to June 1998)

From: Rick Henderson <rickh@netscape.com>
Date: Fri, 17 Apr 1998 12:41:10 -0700
To: "Saveen Reddy (Exchange)" <saveenr@Exchange.Microsoft.com>
CC: "'www-webdav-dasl@w3.org'" <www-webdav-dasl@w3.org>
Message-ID: <3537B055.F95BB0DA@netscape.com>
Saveen Reddy (Exchange) wrote:

> During the DASL BOF two weeks ago we often ended up saying we needed to
> define the minimum set of useful operations that a server must support.
>
> I want to get a feeling of what people on the list are thinking here. To
> spur this on I'll deliberately take the extreme end of "minimum" and say
> that it means an operator for equality and a simple CONTAINS (with some
> variability for case-sensitivity and language).
> The scenario addressed by this minmum set above is only simple authoring. (I
> realize this doesn't even meet the very first scenario I had in my slides in
> LA, but I wanted to get the discussion going.)

 There is a lot of things that might go under CONTAINS, so I'm not sure this is

a minimal proposal.  If the operand (is there only one operand?) to CONTAINS
is a phrase must it support phrase search?

In the same spirit I'll list some operations that I think we should have in the

minimal set.  Then I'll add some arguments for this choice.

CONTAINS CASE - matches one word in a case insensitive manner.

CONTAINS STEM - matches one word that is a form of the given word, case
insensitive.

CONTAINS PHRASE CASE - matches multiple words in the given order
without intervening words in a case insensitive manner, this would NOT be
sensitive to the spaces or punctuation between words.

CONTAINS PHRASE STEM - matches multiple words in the given order
without intervening words that are a form of the given words, in a case
insensitive manner.

SIMILARITY - matches according to the server's best algorithm the top
so many documents.  We don't specify in detail what criteria the server
can use to select the documents.  Proximity, word frequency, inverse
document frequency, thesaurus lookup, grammatical analysis, and
soundex are all possibilities.  Similarity would take an extra parameter
that limits the number of documents returned.


All these operations must have some indication of language, or a language
implied somehow.  We can't specify in detail exactly what is defined as a
form of the same word.  That can be left to the implementation.

Why we need these:

Without stemming too much is missed in many cases.  Stemming is needed
to broaden a search that wouldn't find enough relevant documents otherwise.

With that said we can't require stemming all the time.  Sometimes the user
is going for a particular form or a particular phrase and in these cases
stemming
just adds extra unwanted documents.

Phrase is needed because so many ideas are expressed as a phrase of two
otherwise common words.  e.g. "blood bank", "mind set", "United States".
Using AND as a substitute for PHRASE will bring in a lot of unwanted stuff.

Similarity is needed to address the problem that the very limited search
capabilities
defined otherwise tend to bring in too few or too many documents.  It also
allows
a lot more power without adding a large number of operations to the minimum
set.

Why we can live without more in the minimum set:

This proposal would leave out of the minimum set such valuable operations
as exact case sensitive match, proximity (word, sentence, paragraph), wild
card,
soundex, and thesaurus.

 To value of proximity, soundex, and wild card are mostly to approximate the
ideal
of a concept search.  The user is thinking of something and can construct
complex
queries to try and tease out the documents that match their concept but don't
bring
in a lot of other documents.  This is addressed with less specification and
potentially
more powerfully by Similarity.

Exact case sensitive match isn't that much of a gain over case insensitive
match.
It's a nice feature but noise brought in by case insensitivity is not that
great.  In
the interest of minimalism it should be left out.  Nor is case sensitivity
cheap.  It
requires either a larger (slower) index or a post search verification step
(very slow).

Thesaurus lookup opens the doors of the search very wide and as such is quite
similar to Similarity without the limitation on getting everything.  I don't
think there
is a strong need to control Thesaurus operations separately from other useful
steps
for broadening a search such as soundex.

--Rick
*************************************************
Rick Henderson            (Netscape)(650)937-3152
rickh@netscape.com
*************************************************
Received on Friday, 17 April 1998 15:48:12 UTC