[Bug 11885] New: [XQFTTS] english-stems.txt stemming dictionary from bugzilla@jessica.w3.org on 2011-01-27 (public-qt-comments@w3.org from January 2011)

From: <bugzilla@jessica.w3.org>
Date: Thu, 27 Jan 2011 08:59:00 +0000
To: public-qt-comments@w3.org
Message-ID: <bug-11885-523@http.www.w3.org/Bugs/Public/>

http://www.w3.org/Bugs/Public/show_bug.cgi?id=11885

           Summary: [XQFTTS] english-stems.txt stemming dictionary
           Product: XPath / XQuery / XSLT
           Version: Proposed Recommendation
          Platform: PC
        OS/Version: Windows NT
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Full Text 1.0
        AssignedTo: jim.melton@acm.org
        ReportedBy: tim@cbcl.co.uk
         QAContact: public-qt-comments@w3.org


The file "english-stems.txt" contains stemming rules only for lower case text. 
However, the specification clearly states that the "Stemming Option must be
applied before the Case Option and the Diacritics Option".

So when tokenizing the string "Dogs and Cats" with stemming, the okens
presented to the tokenizer must be "Dogs", "and", "Cats".

The guidelines for running XQFTTS state that the "stemming-dictionary is a
plain text file containing lines of whitespace-separated tokens. Each token on
the line should stem to the first token on the line."

Note that it is conceivable that the stemming dictionary might stem "AIDS" to
"AIDS" but "aids" to "aid".  This would be a useful test of the order of
application of stemming and case options.  Presumably the test suite doesn't
currently test this.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Thursday, 27 January 2011 08:59:02 UTC