W3C home > Mailing lists > Public > public-qt-comments@w3.org > January 2011

[Bug 11885] New: [XQFTTS] english-stems.txt stemming dictionary

From: <bugzilla@jessica.w3.org>
Date: Thu, 27 Jan 2011 08:59:00 +0000
To: public-qt-comments@w3.org
Message-ID: <bug-11885-523@http.www.w3.org/Bugs/Public/>

           Summary: [XQFTTS] english-stems.txt stemming dictionary
           Product: XPath / XQuery / XSLT
           Version: Proposed Recommendation
          Platform: PC
        OS/Version: Windows NT
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Full Text 1.0
        AssignedTo: jim.melton@acm.org
        ReportedBy: tim@cbcl.co.uk
         QAContact: public-qt-comments@w3.org

The file "english-stems.txt" contains stemming rules only for lower case text. 
However, the specification clearly states that the "Stemming Option must be
applied before the Case Option and the Diacritics Option".

So when tokenizing the string "Dogs and Cats" with stemming, the okens
presented to the tokenizer must be "Dogs", "and", "Cats".

The guidelines for running XQFTTS state that the "stemming-dictionary is a
plain text file containing lines of whitespace-separated tokens. Each token on
the line should stem to the first token on the line."

Note that it is conceivable that the stemming dictionary might stem "AIDS" to
"AIDS" but "aids" to "aid".  This would be a useful test of the order of
application of stemming and case options.  Presumably the test suite doesn't
currently test this.

Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Thursday, 27 January 2011 08:59:02 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:57:34 UTC