W3C home > Mailing lists > Public > public-qt-comments@w3.org > June 2010

[FT] FTStopWordOption semantics insufficient?

From: Paul J. Lucas <paul@lucasmail.org>
Date: Thu, 3 Jun 2010 09:49:45 -0700
Message-Id: <6ADC0EA9-C1EE-462A-B6B3-860E59371A54@lucasmail.org>
To: public-qt-comments@w3.org
Section 4.2.5.8 of the Full Text spec says in part:

> The stop words set is computed using the fts:calcStopWords function. The function uses the function fts:resolveStopWordsUri to resolve any URI to a sequence of strings. Then, the stop words are removed from the set of query tokens.

It seems insufficient simply to remove the stop words from the set of query tokens and keep the rest of the semantics the same.

Section 8.2.1 of the Full Text Uses Cases says in part:

> Once the stop word "then" has been identified via the stop word list ... [the query "planning then conducting] is reduced to a query on the phrase "planning" any word "conducting", allowing any word as a substitute for the stop word.

If stop words were simply removed per section 4.2.5.8, the query would be reduced to the phrase "planning conducting" which is not the same thing.

It would seem to be that any query of the form:

	"W S W" using stop words

(where W is a non-stop word and S is a stop-word) would be mostly equivalent to:

	"W .* W" using wildcards

except that any other punctuation would not be treated as wildcard syntax (unless the original query also has the "using wildcards" option).

Either that, or the concept of a special "match-anything token" has to be introduced and the implementation of matchTokenInfos() has to take it into account.

Comments?

- Paul
Received on Thursday, 3 June 2010 16:50:22 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:57:31 UTC