- From: Jim Melton <jim.melton@oracle.com>
- Date: Tue, 14 Apr 2009 12:17:34 -0600
- To: bugzilla@farnsworth.w3.org
- Cc: public-qt-comments@w3.org
- Message-Id: <7.0.1.0.2.20090414115549.0036f3a0@oracle.com>
Christian, I, too, am not a thesaurus expert, but I believe that I can answer a couple of your questions. At 4/13/2009 07:43 PM, bugzilla@farnsworth.w3.org wrote: >http://www.w3.org/Bugs/Public/show_bug.cgi?id=6809 > > Summary: [FT] Test Suite - Thesaurus Queries > Product: XPath / XQuery / XSLT > Version: Candidate Recommendation > Platform: All > OS/Version: All > Status: NEW > Severity: normal > Priority: P2 > Component: Full Text 1.0 > AssignedTo: jim.melton@acm.org > ReportedBy: christian.gruen@gmail.com > QAContact: public-qt-comments@w3.org > > >Dear task force, > >I decided to add a basic Thesaurus implementation to BaseX to support and test >the remaining queries. I frankly admit that I'm no Thesaurus expert at all, so >I mainly focused on the hints in the specification and the existing tests. As >I'm not sure if I completely understood what's going on in the test examples, >here are some more questions/bug indications: > > >[1] ft-3.4.3-examples-q1 > >The usability.xml thesaurus file returns the synonym "tasks" for the query >input "duties" - but the queried document node includes only the word in >singular ("task" instead of "tasks"). Is this intended? I would say "no" because I don't believe that thesauri are expected to *ALSO* do stemming. But Pat or Mary will have more authoritative responses. >[2] ft-3.4.3-examples-q2 > >The thesaurus offers the terms "navigation", >"layout" and "terminology" for the >query phrase "web site components", but all of the terms are not included in >the tested document node. > > >[3] ft-3.4.3-examples-q3.xq > >In this query, words similar to "Merrygould" are to be found. As "case >insensitive" is the default options, the term is converted to "merrygould" in >my tests - so the thesaurus doesn't return any result. This is something that you've done incorrectly, I'm sorry to say. If you look into the Unicode rules for comparing character strings, you'll find that "case insensitive" explicitly does NOT mean "put everything into lowercase (or uppercase) and then do the comparison". While that sometimes (almost always, in fact) works for languages that use the simple Latin script (a/k/a "ASCII"), it begins to break when moving into Eastern European scripts. You should at least consider whether you should implement the Unicode "case insensitive" comparison rules. Aside from that, I would expect that thesauri searches should be done with case-insensitive comparisons, in which case the thesaurus search would properly find "Merrygould". Pat and Mary will be more authoritative than I, however. >[4] Probably a naïve question: do all thesaurus entries work in a >"bidirectional" way? I.e., if "A" is a synonym for "B", do I get "A" if I look >for "B", and "B" if I look for "A"? Next to that, are all synonym >bidirectional? One could argue that "Marigold" sounds like "Merrygould", but >"Merrygould" doesn't sound like "Marigold". In >the latter case, the upper query >[3] would only return results in the direction opposite to the current one. > > >[5] ft-3.4.3-expressions-q3 > >The thesaurus returns "software" for the term "program"; this term seems to be >included in two books (number 1 and 3), but the current result contains only >book 1. > > >[6] ft-3.4.3-expressions-q5 > >..references the missing file "TechnicalThesaurus.xml". The test suite catalog has this element: <thesaurus ID="technical" uri="http://bstore1.example.com/TechnicalThesaurus.xml" FileName="TestSources/intentionally-missing.xml" Creator="Full-Text Task Force"> <description last-mod="2009-01-09">(Missing) technical thesaurus</description> </thesaurus> From the FileName and from the description, I believe it's evident that the file is INTENDED to be missing. However, looking at the catalog entry for the test: <test-case is-XPath2="false" name="ft-3.4.3-expressions-q5" FilePath="Expressions/Operators/CompExpr/FTContainsExpr/FTSelection/MatchOptions/FTThesaurus/" scenario="standard" Creator="Full-Text Task Force"> <description>WIth thesaurus level query. Find infrastructure at the 2nd level of narrower terms in a thesaurus.</description> <spec-citation spec="XQueryFullText" section-number="3.4.3" section-title="Thesaurus Option" section-pointer="ftthesaurusoption"/> <query name="ftthesaurus-q5" date="2008-11-28"/> <aux-URI role="thesaurus">usability</aux-URI> <input-file role="principal-data" variable="input-context">ftusecases</input-file> <output-file role="principal" compare="Fragment">ftthesaurus-results-q5.txt</output-file> </test-case> I see that there is an expected result file, but not a result that is an exception/error. The spec says, in section 3.4.3, this: If the URI specifies a thesaurus that is not found in the statically known thesauris, an error is raised [<file:///E:/w3ccvs/WWW/XML/Group/qtspecs/specifications/xpath-full-text-10/html/Overview.html#ERRFTST0018>err:FTST0018]. So, I'm confused, too. >[7] ft-3.4.3-expressions-q6 > >parentheses missing before "default" and after >"NT". I guess that the Thesaurus >should also accept the original query terms and not only synonyms; is this >correct? If "yes", then book number 3 should be >added as result, as it contains >the term "Computers". > > >[8] thesaurus-queries-results-q2 / q2b > >As the used relationship is "narrower terms" >here (instead of "NT" or "narrower >term") - do you expect implementations to >recognize all kinds of writings, or ? I believe this is a typographical error in the queries themselves. The relationships are prescriptive, complete, and definitive, so we do not expect for variations to be acceptable. HOWEVER, the spec states in section 3.4.3 that: If a query specifies thesaurus relationships or levels not supported by the thesaurus, or does not specify a relationship, the behavior is <file:///E:/w3ccvs/WWW/XML/Group/qtspecs/specifications/xpath-full-text-10/html/Overview.html#dt-implementation-defined>implementation-defined. Therefore, the use of "narrower terms" is technically allowed, but that depends on the implementation. I would expect that a strict implementation would raise an implementation-defined error, but a more relaxed implementation could choose to ignore the relationship semantically. >[9] thesaurus-queries-results-q5 / q5b / q6 / q6b > >"spellcheck.xml" and "OurTaxonomy.xml" don't exist yet. > > >[10] full-text-composability-queries-results-q2b > >Parsing issue: "]" missing after "stemming" > > >[11] full-text-composability-queries-results-q3 / q3b > >Parsing issue: some opening and closing parentheses are missing. > > > >I'm currently running the Thesaurus as the last >match option, as I saw that the >execution order of match options seems to be implementation defined. It may >well be that different orders could result in >different results - but I haven't >really thought this through. Michael Dyck should have some thoughts here, as I know he did some serious thinking about the sequence of application of match options. Hope this helps, Jim >Concluding, as I indicated in the beginning, my knowledge on Thesauri is very >limited. So maybe it will be helpful to directly talk to one of you in near >future to get more insight in some of the open issues.. > >Christian > > >-- >Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email >------- You are receiving this mail because: ------- >You are the QA contact for the bug. ======================================================================== Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144 Chair, W3C XML Query WG; XQX (etc.) editor Fax : +1.801.942.3345 Oracle Corporation Oracle Email: jim dot melton at oracle dot com 1930 Viscounti Drive Standards email: jim dot melton at acm dot org Sandy, UT 84093-1063 USA Personal email: jim at melton dot name ======================================================================== = Facts are facts. But any opinions expressed are the opinions = = only of myself and may or may not reflect the opinions of anybody = = else with whom I may or may not have discussed the issues at hand. = ========================================================================
Received on Tuesday, 14 April 2009 18:18:42 UTC