W3C home > Mailing lists > Public > public-qt-comments@w3.org > February 2011

[Bug 12109] New: [FT] StopWord Option

From: <bugzilla@jessica.w3.org>
Date: Thu, 17 Feb 2011 14:27:01 +0000
To: public-qt-comments@w3.org
Message-ID: <bug-12109-523@http.www.w3.org/Bugs/Public/>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=12109

           Summary: [FT] StopWord Option
           Product: XPath / XQuery / XSLT
           Version: Proposed Recommendation
          Platform: PC
        OS/Version: Windows NT
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Full Text 1.0
        AssignedTo: jim.melton@acm.org
        ReportedBy: tim@cbcl.co.uk
         QAContact: public-qt-comments@w3.org


There is very little information given regarding how stop words work except as
part of phrases or in the context of FTWindow/FTDistance.

In information retrieval system based upon inverted indices, it is traditional
to use stop words to remove high frequency terms from the index to reduce the
size of the inverted index.  It is also traditional to ignore stop words during
query processing to improve query performance (both speed and precision).

The text:

"Some implementations may apply stop word lists during indexing and be unable
to comply with query-time requests to not apply those stop words."

implies that XQuery Full Text is amenable to the approach of inverted indices
with stop words stripped at index time.

Consider the query:

declase ft-option using stop words ("be", "not", "or", "to");

"to be or not to be" contains text "to"

According to the specification

"Stop words are tokens in the query that match any token in the text being
searched"

This seems to suggest that the result should be identical to

"to be or not to be" contains text ".+" using wildcards

Since "to be or not to be" is entirely composed of stop words, any application
of stop word lists during indexing means that it contains no tokens and thus
the result would be "false" rather than "true".

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Thursday, 17 February 2011 14:27:02 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:45:45 UTC