- From: Jim Melton <jim.melton@oracle.com>
- Date: Tue, 14 Apr 2009 12:17:34 -0600
- To: bugzilla@farnsworth.w3.org
- Cc: public-qt-comments@w3.org
- Message-Id: <7.0.1.0.2.20090414115549.0036f3a0@oracle.com>
Christian,
I, too, am not a thesaurus expert, but I believe
that I can answer a couple of your questions.
At 4/13/2009 07:43 PM, bugzilla@farnsworth.w3.org wrote:
>http://www.w3.org/Bugs/Public/show_bug.cgi?id=6809
>
> Summary: [FT] Test Suite - Thesaurus Queries
> Product: XPath / XQuery / XSLT
> Version: Candidate Recommendation
> Platform: All
> OS/Version: All
> Status: NEW
> Severity: normal
> Priority: P2
> Component: Full Text 1.0
> AssignedTo: jim.melton@acm.org
> ReportedBy: christian.gruen@gmail.com
> QAContact: public-qt-comments@w3.org
>
>
>Dear task force,
>
>I decided to add a basic Thesaurus implementation to BaseX to support and test
>the remaining queries. I frankly admit that I'm no Thesaurus expert at all, so
>I mainly focused on the hints in the specification and the existing tests. As
>I'm not sure if I completely understood what's going on in the test examples,
>here are some more questions/bug indications:
>
>
>[1] ft-3.4.3-examples-q1
>
>The usability.xml thesaurus file returns the synonym "tasks" for the query
>input "duties" - but the queried document node includes only the word in
>singular ("task" instead of "tasks"). Is this intended?
I would say "no" because I don't believe that
thesauri are expected to *ALSO* do stemming. But
Pat or Mary will have more authoritative responses.
>[2] ft-3.4.3-examples-q2
>
>The thesaurus offers the terms "navigation",
>"layout" and "terminology" for the
>query phrase "web site components", but all of the terms are not included in
>the tested document node.
>
>
>[3] ft-3.4.3-examples-q3.xq
>
>In this query, words similar to "Merrygould" are to be found. As "case
>insensitive" is the default options, the term is converted to "merrygould" in
>my tests - so the thesaurus doesn't return any result.
This is something that you've done incorrectly,
I'm sorry to say. If you look into the Unicode
rules for comparing character strings, you'll
find that "case insensitive" explicitly does NOT
mean "put everything into lowercase (or
uppercase) and then do the comparison". While
that sometimes (almost always, in fact) works for
languages that use the simple Latin script (a/k/a
"ASCII"), it begins to break when moving into
Eastern European scripts. You should at least
consider whether you should implement the Unicode
"case insensitive" comparison rules.
Aside from that, I would expect that thesauri
searches should be done with case-insensitive
comparisons, in which case the thesaurus search
would properly find "Merrygould". Pat and Mary
will be more authoritative than I, however.
>[4] Probably a naïve question: do all thesaurus entries work in a
>"bidirectional" way? I.e., if "A" is a synonym for "B", do I get "A" if I look
>for "B", and "B" if I look for "A"? Next to that, are all synonym
>bidirectional? One could argue that "Marigold" sounds like "Merrygould", but
>"Merrygould" doesn't sound like "Marigold". In
>the latter case, the upper query
>[3] would only return results in the direction opposite to the current one.
>
>
>[5] ft-3.4.3-expressions-q3
>
>The thesaurus returns "software" for the term "program"; this term seems to be
>included in two books (number 1 and 3), but the current result contains only
>book 1.
>
>
>[6] ft-3.4.3-expressions-q5
>
>..references the missing file "TechnicalThesaurus.xml".
The test suite catalog has this element:
<thesaurus ID="technical"
uri="http://bstore1.example.com/TechnicalThesaurus.xml"
FileName="TestSources/intentionally-missing.xml"
Creator="Full-Text Task Force">
<description
last-mod="2009-01-09">(Missing) technical thesaurus</description>
</thesaurus>
From the FileName and from the description, I
believe it's evident that the file is INTENDED to
be missing. However, looking at the catalog entry for the test:
<test-case is-XPath2="false"
name="ft-3.4.3-expressions-q5"
FilePath="Expressions/Operators/CompExpr/FTContainsExpr/FTSelection/MatchOptions/FTThesaurus/"
scenario="standard" Creator="Full-Text Task Force">
<description>WIth thesaurus
level query. Find infrastructure at the 2nd level
of narrower terms in a thesaurus.</description>
<spec-citation
spec="XQueryFullText" section-number="3.4.3"
section-title="Thesaurus Option" section-pointer="ftthesaurusoption"/>
<query name="ftthesaurus-q5" date="2008-11-28"/>
<aux-URI role="thesaurus">usability</aux-URI>
<input-file
role="principal-data" variable="input-context">ftusecases</input-file>
<output-file
role="principal" compare="Fragment">ftthesaurus-results-q5.txt</output-file>
</test-case>
I see that there is an expected result file, but
not a result that is an exception/error. The
spec says, in section 3.4.3, this:
If the URI specifies a thesaurus that is not
found in the statically known thesauris, an error
is raised
[<file:///E:/w3ccvs/WWW/XML/Group/qtspecs/specifications/xpath-full-text-10/html/Overview.html#ERRFTST0018>err:FTST0018].
So, I'm confused, too.
>[7] ft-3.4.3-expressions-q6
>
>parentheses missing before "default" and after
>"NT". I guess that the Thesaurus
>should also accept the original query terms and not only synonyms; is this
>correct? If "yes", then book number 3 should be
>added as result, as it contains
>the term "Computers".
>
>
>[8] thesaurus-queries-results-q2 / q2b
>
>As the used relationship is "narrower terms"
>here (instead of "NT" or "narrower
>term") - do you expect implementations to
>recognize all kinds of writings, or ?
I believe this is a typographical error in the
queries themselves. The relationships are
prescriptive, complete, and definitive, so we do
not expect for variations to be
acceptable. HOWEVER, the spec states in section 3.4.3 that:
If a query specifies thesaurus relationships
or levels not supported by the thesaurus, or does
not specify a relationship, the behavior is
<file:///E:/w3ccvs/WWW/XML/Group/qtspecs/specifications/xpath-full-text-10/html/Overview.html#dt-implementation-defined>implementation-defined.
Therefore, the use of "narrower terms" is
technically allowed, but that depends on the
implementation. I would expect that a strict
implementation would raise an
implementation-defined error, but a more relaxed
implementation could choose to ignore the relationship semantically.
>[9] thesaurus-queries-results-q5 / q5b / q6 / q6b
>
>"spellcheck.xml" and "OurTaxonomy.xml" don't exist yet.
>
>
>[10] full-text-composability-queries-results-q2b
>
>Parsing issue: "]" missing after "stemming"
>
>
>[11] full-text-composability-queries-results-q3 / q3b
>
>Parsing issue: some opening and closing parentheses are missing.
>
>
>
>I'm currently running the Thesaurus as the last
>match option, as I saw that the
>execution order of match options seems to be implementation defined. It may
>well be that different orders could result in
>different results - but I haven't
>really thought this through.
Michael Dyck should have some thoughts here, as I
know he did some serious thinking about the
sequence of application of match options.
Hope this helps,
Jim
>Concluding, as I indicated in the beginning, my knowledge on Thesauri is very
>limited. So maybe it will be helpful to directly talk to one of you in near
>future to get more insight in some of the open issues..
>
>Christian
>
>
>--
>Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
>------- You are receiving this mail because: -------
>You are the QA contact for the bug.
========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144
Chair, W3C XML Query WG; XQX (etc.) editor Fax : +1.801.942.3345
Oracle Corporation Oracle Email: jim dot melton at oracle dot com
1930 Viscounti Drive Standards email: jim dot melton at acm dot org
Sandy, UT 84093-1063 USA Personal email: jim at melton dot name
========================================================================
= Facts are facts. But any opinions expressed are the opinions =
= only of myself and may or may not reflect the opinions of anybody =
= else with whom I may or may not have discussed the issues at hand. =
========================================================================
Received on Tuesday, 14 April 2009 18:18:42 UTC