Re: [Bug 6809] New: [FT] Test Suite - Thesaurus Queries

Christian,

I, too, am not a thesaurus expert, but I believe 
that I can answer a couple of your questions.

At 4/13/2009 07:43 PM, bugzilla@farnsworth.w3.org wrote:
>http://www.w3.org/Bugs/Public/show_bug.cgi?id=6809
>
>            Summary: [FT] Test Suite - Thesaurus Queries
>            Product: XPath / XQuery / XSLT
>            Version: Candidate Recommendation
>           Platform: All
>         OS/Version: All
>             Status: NEW
>           Severity: normal
>           Priority: P2
>          Component: Full Text 1.0
>         AssignedTo: jim.melton@acm.org
>         ReportedBy: christian.gruen@gmail.com
>          QAContact: public-qt-comments@w3.org
>
>
>Dear task force,
>
>I decided to add a basic Thesaurus implementation to BaseX to support and test
>the remaining queries. I frankly admit that I'm no Thesaurus expert at all, so
>I mainly focused on the hints in the specification and the existing tests. As
>I'm not sure if I completely understood what's going on in the test examples,
>here are some more questions/bug indications:
>
>
>[1] ft-3.4.3-examples-q1
>
>The usability.xml thesaurus file returns the synonym "tasks" for the query
>input "duties" - but the queried document node includes only the word in
>singular ("task" instead of "tasks"). Is this intended?

I would say "no" because I don't believe that 
thesauri are expected to *ALSO* do stemming.  But 
Pat or Mary will have more authoritative responses.



>[2] ft-3.4.3-examples-q2
>
>The thesaurus offers the terms "navigation", 
>"layout" and "terminology" for the
>query phrase "web site components", but all of the terms are not included in
>the tested document node.
>
>
>[3] ft-3.4.3-examples-q3.xq
>
>In this query, words similar to "Merrygould" are to be found. As "case
>insensitive" is the default options, the term is converted to "merrygould" in
>my tests - so the thesaurus doesn't return any result.

This is something that you've done incorrectly, 
I'm sorry to say.  If you look into the Unicode 
rules for comparing character strings, you'll 
find that "case insensitive" explicitly does NOT 
mean "put everything into lowercase (or 
uppercase) and then do the comparison".  While 
that sometimes (almost always, in fact) works for 
languages that use the simple Latin script (a/k/a 
"ASCII"), it begins to break when moving into 
Eastern European scripts.  You should at least 
consider whether you should implement the Unicode 
"case insensitive" comparison rules.

Aside from that, I would expect that thesauri 
searches should be done with case-insensitive 
comparisons, in which case the thesaurus search 
would properly find "Merrygould".  Pat and Mary 
will be more authoritative than I, however.



>[4] Probably a naïve question: do all thesaurus entries work in a
>"bidirectional" way? I.e., if "A" is a synonym for "B", do I get "A" if I look
>for "B", and "B" if I look for "A"? Next to that, are all synonym
>bidirectional? One could argue that "Marigold" sounds like "Merrygould", but
>"Merrygould" doesn't sound like "Marigold". In 
>the latter case, the upper query
>[3] would only return results in the direction opposite to the current one.
>
>
>[5] ft-3.4.3-expressions-q3
>
>The thesaurus returns "software" for the term "program"; this term seems to be
>included in two books (number 1 and 3), but the current result contains only
>book 1.
>
>
>[6] ft-3.4.3-expressions-q5
>
>..references the missing file "TechnicalThesaurus.xml".

The test suite catalog has this element:

     <thesaurus ID="technical" 
uri="http://bstore1.example.com/TechnicalThesaurus.xml" 
FileName="TestSources/intentionally-missing.xml" 
Creator="Full-Text Task Force">
       <description 
last-mod="2009-01-09">(Missing) technical thesaurus</description>
     </thesaurus>

 From the FileName and from the description, I 
believe it's evident that the file is INTENDED to 
be missing.  However, looking at the catalog entry for the test:

                   <test-case is-XPath2="false" 
name="ft-3.4.3-expressions-q5" 
FilePath="Expressions/Operators/CompExpr/FTContainsExpr/FTSelection/MatchOptions/FTThesaurus/" 
scenario="standard" Creator="Full-Text Task Force">
                     <description>WIth thesaurus 
level query. Find infrastructure at the 2nd level 
of narrower terms in a thesaurus.</description>
                     <spec-citation 
spec="XQueryFullText" section-number="3.4.3" 
section-title="Thesaurus Option" section-pointer="ftthesaurusoption"/>
                     <query name="ftthesaurus-q5" date="2008-11-28"/>
                     <aux-URI role="thesaurus">usability</aux-URI>
                     <input-file 
role="principal-data" variable="input-context">ftusecases</input-file>
                     <output-file 
role="principal" compare="Fragment">ftthesaurus-results-q5.txt</output-file>
                   </test-case>

I see that there is an expected result file, but 
not a result that is an exception/error.  The 
spec says, in section 3.4.3, this:

    If the URI specifies a thesaurus that is not 
found in the statically known thesauris, an error 
is raised 
[<file:///E:/w3ccvs/WWW/XML/Group/qtspecs/specifications/xpath-full-text-10/html/Overview.html#ERRFTST0018>err:FTST0018]. 


So, I'm confused, too.




>[7] ft-3.4.3-expressions-q6
>
>parentheses missing before "default" and after 
>"NT". I guess that the Thesaurus
>should also accept the original query terms and not only synonyms; is this
>correct? If "yes", then book number 3 should be 
>added as result, as it contains
>the term "Computers".
>
>
>[8] thesaurus-queries-results-q2 / q2b
>
>As the used relationship is "narrower terms" 
>here (instead of "NT" or "narrower
>term") - do you expect implementations to 
>recognize all kinds of writings, or ?

I believe this is a typographical error in the 
queries themselves.  The relationships are 
prescriptive, complete, and definitive, so we do 
not expect for variations to be 
acceptable.  HOWEVER, the spec states in section 3.4.3 that:

    If a query specifies thesaurus relationships 
or levels not supported by the thesaurus, or does 
not specify a relationship, the behavior is 
<file:///E:/w3ccvs/WWW/XML/Group/qtspecs/specifications/xpath-full-text-10/html/Overview.html#dt-implementation-defined>implementation-defined. 


Therefore, the use of "narrower terms" is 
technically allowed, but that depends on the 
implementation.  I would expect that a strict 
implementation would raise an 
implementation-defined error, but a more relaxed 
implementation could choose to ignore the relationship semantically.




>[9] thesaurus-queries-results-q5 / q5b / q6 / q6b
>
>"spellcheck.xml" and "OurTaxonomy.xml" don't exist yet.
>
>
>[10] full-text-composability-queries-results-q2b
>
>Parsing issue: "]" missing after "stemming"
>
>
>[11] full-text-composability-queries-results-q3 / q3b
>
>Parsing issue: some opening and closing parentheses are missing.
>
>
>
>I'm currently running the Thesaurus as the last 
>match option, as I saw that the
>execution order of match options seems to be implementation defined. It may
>well be that different orders could result in 
>different results - but I haven't
>really thought this through.

Michael Dyck should have some thoughts here, as I 
know he did some serious thinking about the 
sequence of application of match options.

Hope this helps,
    Jim


>Concluding, as I indicated in the beginning, my knowledge on Thesauri is very
>limited. So maybe it will be helpful to directly talk to one of you in near
>future to get more insight in some of the open issues..
>
>Christian
>
>
>--
>Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
>------- You are receiving this mail because: -------
>You are the QA contact for the bug.

========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
   Chair, W3C XML Query WG; XQX (etc.) editor       Fax : +1.801.942.3345
Oracle Corporation        Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive      Standards email: jim dot melton at acm dot org
Sandy, UT 84093-1063 USA          Personal email: jim at melton dot name
========================================================================
=  Facts are facts.   But any opinions expressed are the opinions      =
=  only of myself and may or may not reflect the opinions of anybody   =
=  else with whom I may or may not have discussed the issues at hand.  =
========================================================================  

Received on Tuesday, 14 April 2009 18:18:42 UTC