[Bug 6830] New: [FT] Thesaurus vs other Match Options


           Summary: [FT] Thesaurus vs other Match Options
           Product: XPath / XQuery / XSLT
           Version: Candidate Recommendation
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Full Text 1.0
        AssignedTo: jim.melton@acm.org
        ReportedBy: christian.gruen@gmail.com
         QAContact: public-qt-comments@w3.org

Hi again,

I noticed that the evaluation of a combination of several match options with
the Thesaurus may lead to different interpretations. My major question is if
other match options influence the way the thesaurus works. An example:

 "improving" ftcontains "improve" with stemming

This query should return true. If we add a thesaurus here..

 "improving" ftcontains "optimizing" with stemming with thesaurus..

...and if the thesaurus resvolves "optimize" to "improve", I am wondering if
this query will return true, as the thesaurus entries would have to be stemmed
as well.

The same problem/question occurs with the default match options. E.g.: Are
diacritics to be removed in the thesaurus?

As a Thesaurus can get pretty large, similar to index structures, I would
recommend to apply all match options while building and BEFORE querying the
Thesaurus - otherwise, Thesaurus requests could get pretty expensive. This is
why I would propose to extend section 3.4 of the specification:

   1. The Language Option must be applied first
   2. The Stemming Option must be applied before the Case Option and the 
      Diacritics Option
-> 3. The Thesaurus Option must be applied after all other options

This will also make sense, as the Thesaurus might not be accessed at all if the
query and document term equal anyway...

  "A" ftcontains "A" with thesaurus...
  -> should yields true without even checking the thesaurus

I just discovered the following sentence in the first section of the Specs..

"The WGs particularly solicit feedback regarding how thesauri are to be used in

So I hope that my discussion here contributes a little to this issue.


Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Thursday, 16 April 2009 19:30:45 UTC