Re: FTS comments from Pat Case on 2003-03-24 (public-qt-comments@w3.org from March 2003)

From: Pat Case <PCASE@crs.loc.gov>
Date: Mon, 24 Mar 2003 14:41:39 -0500
To: <kai.grossjohann@uni-duisburg.de>
Cc: <public-qt-comments@w3.org>
Message-Id: <se7f1931.062@crs.loc.gov>
Kai, 

From where I sit, the first thing we need is full-text querying in
XQuery encompassing functionalities which currently exist.

As a librarian and expert searcher, I find even stemming algorithms
fail me often enough that I want to retain the crude, but totally
controllable predictable wildcards. I build better queries with
wildcards then I can with stemming, because stemming doesn't allow me to
decide which related words to include on a word by word basis. Just
because it is linguistically related doesn't mean it returns the results
I want. Stemming is black box which works against expert searchers as
often as it work for them. We feel the same about scoring and ranking.

Different users benefit from different tools. I wouldn't expect a
novice user to use wildcards or to be so annoyed  with scoring and
ranking.

Which doesn't mean I wouldn't welcome the likes of a linguistic parser.
 It would be a boon to all end users. Are you recommending we add a use
case which calls an implementation-defined linguistic parser (as we did
for stemming) or are you recommending more than that?

Pat


>>> Kai Großjohann <kai.grossjohann@uni-duisburg.de> 03/24/03 01:14PM
>>>

"Pat Case" <PCASE@crs.loc.gov> writes:

> [Pat Case: Please remember that a phrase query is a proximity query
> (ordered, allowing no intervening words).

IMHO this is an unnecessary limitation.

In the following I will say "linguistic phrase search" when talking
about the operation I mean, to avoid confusion with the FTS document.

> Also  remember we are defining the functionalities which will be
> available for implementors. We don't expect most end users to define
the
> parameters for a proximity query, but we do expect them to profit
from
> proximity querying. 

Proximity querying is a good thing, but linguistic phrase search is
even better :-)  -- at least for some common use cases.

(Just like wildcards are a good thing, but often, stemming is even
better.)

> I do not emphasize phrase querying because I think it is as
dangerous
> as "or" querying is useless. I advise end users to do wider
unordered
> proximity queries instead. In a system which supports phrase query I
> would build a More button that runs a wider unordered proximity query
to
> pick up the missed results. My favorite example is in the internal
> system I work on for congressional documents. Folks search on
> "elementary education" and find very little. It is a reasonable
query
> but it fails because congressional bills almost exclusively carry
the
> phrase "elementary and secondary education". Allow a few intervening
> characters and hundreds of bills are returned.]

If you have a linguistic parser that recognizes that "elementary and
secondary education" is the same as "elementary education and
secondary education", then you don't need kludges such as proximity
search to correctly find the linguistic phrase "elementary education".

So by focussing on the ordered/unordered proximity search with
various constraints, system containing a linguistic parser are left
standing in the rain.

My suggestion is to imagine what does the user want to do,
semantically, and then define search predicates that capture this
idea.  Then it is left to the implementor (of the XQuery processing
engine) to decide how to do it.  Of course, it's possible to
recognize that linguistic parsers are not common in text retrieval
systems yet, and to include some other search predicates to
accomodate for this fact.  But why let the standards definition be
influenced by what current systems do?

The FTS definition has successfully made the step from the syntactic
level (wildcards) to the semantic level (stemming) in one case --
normalization of derived forms.  I am suggesting to make this very
same step in other cases, as well: searching for (linguistic) noun
phrases (perhaps other kinds of phrases), searching for
similarly-sounding words (Soundex is only one algorithm for this),
accomodating for OCR errors or typos (the Damerau-Levenshtein metric
is one approach), searching for similar dates, searching for
similarly-looking colors, and so on.

Note that the FTS document does not say how to implement stemming.  So
I don't expect that a similar lack of specification how linguistic
phrase search will be implemented is a problem.


I think that I'm having problems to express the ideas in my mind.  I
hope you can understand them even though the expression is clumsy.
-- 
A preposition is not a good thing to end a sentence with.
Received on Monday, 24 March 2003 14:42:06 UTC