W3C home > Mailing lists > Public > public-qt-comments@w3.org > September 2011

[Bug 12109] [FT] StopWord Option

From: <bugzilla@jessica.w3.org>
Date: Thu, 22 Sep 2011 09:04:48 +0000
To: public-qt-comments@w3.org
Message-Id: <E1R6fD6-0007jv-1z@jessica.w3.org>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=12109

--- Comment #3 from Tim Mills <tim@cbcl.co.uk> 2011-09-22 09:04:46 UTC ---
By way of further explanation, the following is motivated by the conviction
that XQuery Full Text should be able to be implemented using a traditional full
text inverted index.

The examples given below have the form

$arg contains text "STOPWORD" using stop words ("STOPWORD") 

but can be interpreted as:

count(fts:tokenize($arg)) >= 1

where fts:tokenize is an implementation-defined function which returns a
sequence of element(TokenInfo) resulting from tokenization of its arguments.

This illustrates how the stop word feature can be abused to do things which
have little to do with stop word handling.  I'd argue this is a Bad Thing, and
it results from the specification saying that "Stop words are tokens in the
query that match any token in the text being searched."  I'd argue that this is
quite different from how stop words are generally used in Information
Retrieval, where stop words are typically discarded (either at query time or
index time).


EZAMPLE 1
---------

It is implementation defined whether the following expression will return true
or false.

"not" contains text "and" using stop words ("and", "not") 

EXPLANATION
-----------

Tokenization of "not" using the tokenization rules used for examples will
result in a single token ("not").

The expression will return true if the implementation matches the stop word
'and' against the single token "not".  (Performing such a match with an
inverted index is inefficient.)

The expression will return false if the implementation has discarded stop words
during indexing (as permitted by the note in section 3.4.7).


EZAMPLE 2
---------

I can't find an argument in the specification to support returning false for
the fallowing query.

"strange" contains text "and" using stop words ("and") 

However, I believe this to be a problem with the specification.  It is quite
normal for an IR system to return false for such a query.

EXPLANATION
-----------

Tokenization of "strange" using the tokenization rules used for examples will
result in a single token ("strange").

The expression will return true if the implementation matches the stopword
'and' against the single token "strange".

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Thursday, 22 September 2011 09:04:53 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 27 March 2012 18:15:15 GMT