- From: Rick Henderson <rickh@netscape.com>
- Date: Fri, 17 Apr 1998 12:41:10 -0700
- To: "Saveen Reddy (Exchange)" <saveenr@Exchange.Microsoft.com>
- CC: "'www-webdav-dasl@w3.org'" <www-webdav-dasl@w3.org>
Saveen Reddy (Exchange) wrote: > During the DASL BOF two weeks ago we often ended up saying we needed to > define the minimum set of useful operations that a server must support. > > I want to get a feeling of what people on the list are thinking here. To > spur this on I'll deliberately take the extreme end of "minimum" and say > that it means an operator for equality and a simple CONTAINS (with some > variability for case-sensitivity and language). > The scenario addressed by this minmum set above is only simple authoring. (I > realize this doesn't even meet the very first scenario I had in my slides in > LA, but I wanted to get the discussion going.) There is a lot of things that might go under CONTAINS, so I'm not sure this is a minimal proposal. If the operand (is there only one operand?) to CONTAINS is a phrase must it support phrase search? In the same spirit I'll list some operations that I think we should have in the minimal set. Then I'll add some arguments for this choice. CONTAINS CASE - matches one word in a case insensitive manner. CONTAINS STEM - matches one word that is a form of the given word, case insensitive. CONTAINS PHRASE CASE - matches multiple words in the given order without intervening words in a case insensitive manner, this would NOT be sensitive to the spaces or punctuation between words. CONTAINS PHRASE STEM - matches multiple words in the given order without intervening words that are a form of the given words, in a case insensitive manner. SIMILARITY - matches according to the server's best algorithm the top so many documents. We don't specify in detail what criteria the server can use to select the documents. Proximity, word frequency, inverse document frequency, thesaurus lookup, grammatical analysis, and soundex are all possibilities. Similarity would take an extra parameter that limits the number of documents returned. All these operations must have some indication of language, or a language implied somehow. We can't specify in detail exactly what is defined as a form of the same word. That can be left to the implementation. Why we need these: Without stemming too much is missed in many cases. Stemming is needed to broaden a search that wouldn't find enough relevant documents otherwise. With that said we can't require stemming all the time. Sometimes the user is going for a particular form or a particular phrase and in these cases stemming just adds extra unwanted documents. Phrase is needed because so many ideas are expressed as a phrase of two otherwise common words. e.g. "blood bank", "mind set", "United States". Using AND as a substitute for PHRASE will bring in a lot of unwanted stuff. Similarity is needed to address the problem that the very limited search capabilities defined otherwise tend to bring in too few or too many documents. It also allows a lot more power without adding a large number of operations to the minimum set. Why we can live without more in the minimum set: This proposal would leave out of the minimum set such valuable operations as exact case sensitive match, proximity (word, sentence, paragraph), wild card, soundex, and thesaurus. To value of proximity, soundex, and wild card are mostly to approximate the ideal of a concept search. The user is thinking of something and can construct complex queries to try and tease out the documents that match their concept but don't bring in a lot of other documents. This is addressed with less specification and potentially more powerfully by Similarity. Exact case sensitive match isn't that much of a gain over case insensitive match. It's a nice feature but noise brought in by case insensitivity is not that great. In the interest of minimalism it should be left out. Nor is case sensitivity cheap. It requires either a larger (slower) index or a post search verification step (very slow). Thesaurus lookup opens the doors of the search very wide and as such is quite similar to Similarity without the limitation on getting everything. I don't think there is a strong need to control Thesaurus operations separately from other useful steps for broadening a search such as soundex. --Rick ************************************************* Rick Henderson (Netscape)(650)937-3152 rickh@netscape.com *************************************************
Received on Friday, 17 April 1998 15:48:12 UTC