W3C home > Mailing lists > Public > public-qt-comments@w3.org > November 2008

[Bug 6195] [FT] TestSuite - StopWords & Thesaurus

From: <bugzilla@wiggum.w3.org>
Date: Mon, 24 Nov 2008 20:51:18 +0000
To: public-qt-comments@w3.org
Message-Id: <E1L4iOs-0001Wc-Ui@farnsworth.w3.org>


Mary Holstege <holstege@mathling.com> changed:

           What    |Removed                     |Added
                 CC|                            |holstege@mathling.com
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED

--- Comment #1 from Mary Holstege <holstege@mathling.com>  2008-11-24 20:51:17 ---

The WG discussed this issue and agreed we need to augment the 
testsuite.  Please note that we have not yet completely implemented
the use of this new system throughout the testsuite. If you are satisfied
with this resolution, please mark the bug as closed.

Please note the following addition to the instructions:
Special Sources: Stop Word List, Thesaurus, and Stemming Dictionary
The stopwords, thesaurus, and stemming-dictionary sources are not intended  
to be used directly in the form in which they are given, but to provide  
information to those running the test suite about the expectations a  
particular test has about various implementation-specific aspects of the  
execution context. Implementations are expected to provide equivalent  
information to the query, but in whatever form is appropriate in their  
context. A stopwords source is a plain text file containing list of stop  
words, one per line. When a query references this stop word list, the  
implementation is expected to provide that list of stop words to the  
query. A thesaurus source is an XML document defined against the  
thesaurus.xsd XML Schema. When a query references this thesaurus, the  
implementation is expected to provide equivalent thesaurus information to  
the query. The stemming-dictionary is a plain text file containing lines  
of whitespace-separated tokens. Each token on the line should stem to the  
first token on the line. When the catalog entry for a query references a  
stemming dictionary, the implementation is expected to provide stemming  
equivalent to the rules given in the stemming dictionary.

The basic idea is that there are three new kinds of sources:
A stop word list, which is just a text file, one stop word per line;
a thesaurus, which is an XML file as per the schema; and a stemming
dictionary, which is one stem per line. 

The catalog descriptions for stop word lists and thesauri include a URI
that matches up with the one in the query.  This is similar to the
handling of schemas.  The stemming dictionary has no URI: it is the resource
ID that matters and it is used to define the relevant stem equivalents
when it makes a difference for stemmed search.

** Changes to XQFTTSCatalog.xsd/xml:

Add three new kinds of source roles: stopwords, thesaurus, and  
stemming-dictionary, and corresponding elements in the sources part of 
the catalog. Add an aux-URI element to the test-case itself.

Queries that use a URI for a stop words list should have an aux-URI with
role="stopwords"; queries that us a URI for a thesaurus should have an
aux-URI with role="thesaurus".  Queries that rely on particular stemming
behaviour should have an aux-URI with role="stemming-dictionary".

** Examples:

* Stop words:

Catalog description:
     <stopwords ID="stopwords1"  
uri="http://bstore1.example.com/StopWordList.xml" FileName="stopwords.txt"
        Creator="Full-Text Task Force">
       <description last-mod="2008-11-10">Stop word list for use  

Query description using stopwords
(with stop words at "http://bstore1.example.com/StopWordList.xml"):
     <test-case is-XPath2="true" name="stopwords-1"  
scenario="standard" Creator="Full-Text Task Force">
       <description>Example using stop words</description>
       <spec-citation spec="XQueryFullText" section-number="3.4.7"  
section-title="Stop Word Option" section-pointer="ftstopwordoption"/>
       <query name="stopword-1" date="2008-11-10"/>
       <aux-URI role="stopwords">stopwords1</aux-uri>
       <input-file role="principal-data"  
       <output-file role="principal"  

* Thesaurus: (Schema is TestSources/thesaurus.xsd)

<thesaurus xmlns="http://www.w3.org/xqftts/thesarus">
       <relationship>sounds like</relationship>

Catalog description:
     <thesaurus ID="soundex"  
        Creator="Full-Text Task Force">
       <description last-mod="2008-11-10">Soundex thesaurus for  

Query using thesaurus:
(with thesaurus at "http://bstore1.example.com/UsabilitySoundex.xml"):
     <test-case is-XPath2="true" name="thesaurus-1"  
scenario="standard" Creator="Full-Text Task Force">
       <description>Example using stop words</description>
       <spec-citation spec="XQueryFullText" section-number="3.4.3"  
section-title="Thesaurus Option" section-pointer="ftthesaurusoption"/>
       <query name="thesaurus-1" date="2008-11-10"/>
       <aux-URI role="thesaurus">soundex</aux-uri>
       <input-file role="principal-data"  
       <output-file role="principal"  

* Stemming
improve improves improving improved
dog dogs
cat cats
train trains training trained
error errors

Catalog description:
     <stemming-dictionary ID="english-stems" FileName="english-stems.txt"
        Creator="Full-Text Task Force">
       <description last-mod="2008-11-10">English stems</description>

Query using thesaurus:
(with stemming)
     <test-case is-XPath2="true" name="stemming-1"  
scenario="standard" Creator="Full-Text Task Force">
       <description>Example using stemming</description>
       <spec-citation spec="XQueryFullText" section-number="3.4.4"  
section-title="Stemming Option" section-pointer="ftstemoption"/>
       <query name="stemming-1" date="2008-11-10"/>
       <aux-URI role="stemming-dictionary">english</aux-uri>
       <input-file role="principal-data"  
       <output-file role="principal"  

Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Monday, 24 November 2008 20:51:29 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:57:25 UTC