Re: Why is STOPWORDs a MUST in XQuery/XPath Full-Text and not MAY?

On Tue, Feb 18, 2003 at 05:34:07PM -0500, Todd A. Mancini wrote:
> I don't believe the majority of users want stopwords

Tim Bray used to say, on comp.text I think, that
"stop words are a bug, not a feature".

My own text retrieval system implemented them to save disk space,
but recorded with each posting the fact that one or more stopwords
had been skipped, to improve accuracy.  (I won't say "precision",
as that metric doesn't necessarily increase with more accurate
retrieval!)

In some areas of research, it's common to use a restricted vocabulary
for searching, and stop words may make sense there, especially if
they can be applied selectively, e.g. on a per-docuemnt or
per-repository basis.

When I spoke about lq-text at Usenix [1], someone in the audience
gave an anecdote about IBM's "STAIRS", the first published text
retrieval system.  It was used to help support a legal case, but
during the trial it turned out that one of the parties involved had
a name like "What if, Inc", made of stopwords, and they were unable
to search for it.

Some of the older information retrieval systems may have severe
performance degradations without stop words, unfortunately, so
it's probably wise to allow them.  Requiring them is another matter.

Try a search for "to be or not to be", and see how many systems
(1) take it as a boolean search on "OR NOT" and return all documents
(2) ignore 2-letter words and common words and return nothing
(3) fail because this doesn't exactly occur in Shakespeare -- there's
    a comma and and upper case letter.

Liam

[1] http://www.holoweb.net/~liam/papers/1994-usenix-boston-textretrieval/

-- 
Liam Quin, W3C XML Activity Lead, liam@w3.org, http://www.w3.org/People/Quin/
http://www.holoweb.net/~liam/
Ankh's list of IRC clients: http://www.valinor.sorcery.net/clients/

Received on Wednesday, 19 February 2003 12:13:24 UTC