Why is STOPWORDs a MUST in XQuery/XPath Full-Text and not MAY?

In the current 'XQuery and XPath Full-Text Requirements' draft, section
6.1 Functionaltity, stopwords are labeled as MUST.  I would propose this
being changed to MAY.

The existence of stop words arose mainly from the inability of older
generation full-text engines to be able to handle those terms which were
extremely common in the source text due to inefficient indexing
algorithms.  Stop words were typically NOT a feature -- they were a
band-aid.  It seems unnatural to say that an engine which does not have
stop words is less powerful than an engine with stop words, assuming
both can perform.

Consider a phrase such as "to be or not to be."  If a student is
researching the works of Shakespeare, would that student consider a
full-text engine more powerful if that engine labeled all of these terms
as stop words, and therefore reduced the query to nothing, or, as some
of the use-cases imply, a search for ANY 6-word phrase?

If someone can develop a full-text search engine without stop words
which can perform on-par, or faster than, another engine which has stop
words turned on, should that first vendor be required to add stop word
support?

As the former Director of Professional Services for AltaVista Software,
I speak from experience.  My customers routinely replaced existing
engines which mandated stop words with the AltaVista engine, which does
not natively support stop words nor require such support to perform.
Searching an index of 10 million documents for the word 'the' is
possible with extremely modest hardware.  A great percentage of
real-world scenarios involving Full-Text will likely involve less than
millions of XML nodes -- not having stop word support would be
completely appropriate in the vast majority of uses of the technology.
Therefore, I argue that MAY is a better extent to which stopword support
should be functionally mandated.  [Disclaimer: I am no longer affiliated
with the AltaVista Company, nor am I selling their software.  This is
not an advertisement for the AltaVista software, but rather a real-world
example showing the value of NOT mandating stopword support.]

Thank you for your consideration.

	-Todd Mancini

Received on Monday, 17 February 2003 11:06:01 UTC