- From: Pat Case <PCASE@crs.loc.gov>
- Date: Mon, 24 Mar 2003 14:41:39 -0500
- To: <kai.grossjohann@uni-duisburg.de>
- Cc: <public-qt-comments@w3.org>
Kai, From where I sit, the first thing we need is full-text querying in XQuery encompassing functionalities which currently exist. As a librarian and expert searcher, I find even stemming algorithms fail me often enough that I want to retain the crude, but totally controllable predictable wildcards. I build better queries with wildcards then I can with stemming, because stemming doesn't allow me to decide which related words to include on a word by word basis. Just because it is linguistically related doesn't mean it returns the results I want. Stemming is black box which works against expert searchers as often as it work for them. We feel the same about scoring and ranking. Different users benefit from different tools. I wouldn't expect a novice user to use wildcards or to be so annoyed with scoring and ranking. Which doesn't mean I wouldn't welcome the likes of a linguistic parser. It would be a boon to all end users. Are you recommending we add a use case which calls an implementation-defined linguistic parser (as we did for stemming) or are you recommending more than that? Pat >>> Kai Großjohann <kai.grossjohann@uni-duisburg.de> 03/24/03 01:14PM >>> "Pat Case" <PCASE@crs.loc.gov> writes: > [Pat Case: Please remember that a phrase query is a proximity query > (ordered, allowing no intervening words). IMHO this is an unnecessary limitation. In the following I will say "linguistic phrase search" when talking about the operation I mean, to avoid confusion with the FTS document. > Also remember we are defining the functionalities which will be > available for implementors. We don't expect most end users to define the > parameters for a proximity query, but we do expect them to profit from > proximity querying. Proximity querying is a good thing, but linguistic phrase search is even better :-) -- at least for some common use cases. (Just like wildcards are a good thing, but often, stemming is even better.) > I do not emphasize phrase querying because I think it is as dangerous > as "or" querying is useless. I advise end users to do wider unordered > proximity queries instead. In a system which supports phrase query I > would build a More button that runs a wider unordered proximity query to > pick up the missed results. My favorite example is in the internal > system I work on for congressional documents. Folks search on > "elementary education" and find very little. It is a reasonable query > but it fails because congressional bills almost exclusively carry the > phrase "elementary and secondary education". Allow a few intervening > characters and hundreds of bills are returned.] If you have a linguistic parser that recognizes that "elementary and secondary education" is the same as "elementary education and secondary education", then you don't need kludges such as proximity search to correctly find the linguistic phrase "elementary education". So by focussing on the ordered/unordered proximity search with various constraints, system containing a linguistic parser are left standing in the rain. My suggestion is to imagine what does the user want to do, semantically, and then define search predicates that capture this idea. Then it is left to the implementor (of the XQuery processing engine) to decide how to do it. Of course, it's possible to recognize that linguistic parsers are not common in text retrieval systems yet, and to include some other search predicates to accomodate for this fact. But why let the standards definition be influenced by what current systems do? The FTS definition has successfully made the step from the syntactic level (wildcards) to the semantic level (stemming) in one case -- normalization of derived forms. I am suggesting to make this very same step in other cases, as well: searching for (linguistic) noun phrases (perhaps other kinds of phrases), searching for similarly-sounding words (Soundex is only one algorithm for this), accomodating for OCR errors or typos (the Damerau-Levenshtein metric is one approach), searching for similar dates, searching for similarly-looking colors, and so on. Note that the FTS document does not say how to implement stemming. So I don't expect that a similar lack of specification how linguistic phrase search will be implemented is a problem. I think that I'm having problems to express the ideas in my mind. I hope you can understand them even though the expression is clumsy. -- A preposition is not a good thing to end a sentence with.
Received on Monday, 24 March 2003 14:42:06 UTC