[Bug 9858] [FT] FTStopWordOption and FTCaseOption interaction clarification

http://www.w3.org/Bugs/Public/show_bug.cgi?id=9858





--- Comment #2 from Paul J. Lucas <paul@lucasmail.org>  2010-07-14 15:39:07 ---
Even though this bug has been "resolved" by making the answer "implementation
dependent," the issue, despite Mr. Dyck's statement to the contrary, really
does have to do with the query tokens. So, for the record....

>From the spec, section 3.4.7:

> Stop words are tokens in the *query* that match any token in the text being searched.

> Note the asymmetry in the stop word semantics: the property of being a stop word is only relevant to query terms, not to document terms.

If my query were instead:

    let $x := <p>BEST OF TIMES</p>
    return $x contains text "best any times"
      using stop words ("any")

then the query term would effectively become:

    "best .* times" using wildcards

which matches "BEST OF TIMES" because:

> The "stop words" option specifies that if a token is within the specified collection of stop words, it is removed from the search and any token may be substituted for it.

Using .* as a replacement for each stop word satisfies the semantics of "any
token may be substituted for it."

Now, if we return to my original query: if "using case sensitive" were to apply
to stop-word determination, then "ANY" would not be found in the list of
stop-words of "any"; hence, "ANY" would not be considered a stop-word and
therefore it would not be "removed from the search and [allow] any token [to]
be substituted for it."  So "BEST ANY TIMES" would not match "BEST OF TIMES"
and the query would return false.

If "using case sensitive" were not to be considered during stop-word
determination, then "ANY" would be found in the list of stop-words of "any";
hence "ANY" would be considered a stop-word and therefore would be "removed
from the search and [allow] any token [to] be substituted for it."  So "BEST .*
TIMES" would match "BEST OF TIMES" and the query would return true.

Also, and very importantly, it's intentional and entirely the point that "any"
is *not* in the text being searched.  If the query were instead:

    let $x := <p>BEST ANY TIMES</p>
    return $x contains text "BEST ANY TIMES"
      using stop words ("any")
      using case sensitive

then it would be equivalent to:

    let $x := <p>BEST OF TIMES</p>
    return $x contains text "BEST ANY TIMES"

since the query text matches the search context tokens exactly whether "ANY" is
considered a stop-word or not.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Wednesday, 14 July 2010 15:39:10 UTC