Re: FTS comments

"Pat Case" <PCASE@crs.loc.gov> writes:

> [Pat Case: Please remember that a phrase query is a proximity query
> (ordered, allowing no intervening words).

IMHO this is an unnecessary limitation.

In the following I will say "linguistic phrase search" when talking
about the operation I mean, to avoid confusion with the FTS document.

> Also  remember we are defining the functionalities which will be
> available for implementors. We don't expect most end users to define the
> parameters for a proximity query, but we do expect them to profit from
> proximity querying. 

Proximity querying is a good thing, but linguistic phrase search is
even better :-)  -- at least for some common use cases.

(Just like wildcards are a good thing, but often, stemming is even
better.)

> I do not emphasize phrase querying because I think it is as dangerous
> as "or" querying is useless. I advise end users to do wider unordered
> proximity queries instead. In a system which supports phrase query I
> would build a More button that runs a wider unordered proximity query to
> pick up the missed results. My favorite example is in the internal
> system I work on for congressional documents. Folks search on
> "elementary education" and find very little. It is a reasonable query
> but it fails because congressional bills almost exclusively carry the
> phrase "elementary and secondary education". Allow a few intervening
> characters and hundreds of bills are returned.]

If you have a linguistic parser that recognizes that "elementary and
secondary education" is the same as "elementary education and
secondary education", then you don't need kludges such as proximity
search to correctly find the linguistic phrase "elementary education".

So by focussing on the ordered/unordered proximity search with
various constraints, system containing a linguistic parser are left
standing in the rain.

My suggestion is to imagine what does the user want to do,
semantically, and then define search predicates that capture this
idea.  Then it is left to the implementor (of the XQuery processing
engine) to decide how to do it.  Of course, it's possible to
recognize that linguistic parsers are not common in text retrieval
systems yet, and to include some other search predicates to
accomodate for this fact.  But why let the standards definition be
influenced by what current systems do?

The FTS definition has successfully made the step from the syntactic
level (wildcards) to the semantic level (stemming) in one case --
normalization of derived forms.  I am suggesting to make this very
same step in other cases, as well: searching for (linguistic) noun
phrases (perhaps other kinds of phrases), searching for
similarly-sounding words (Soundex is only one algorithm for this),
accomodating for OCR errors or typos (the Damerau-Levenshtein metric
is one approach), searching for similar dates, searching for
similarly-looking colors, and so on.

Note that the FTS document does not say how to implement stemming.  So
I don't expect that a similar lack of specification how linguistic
phrase search will be implemented is a problem.


I think that I'm having problems to express the ideas in my mind.  I
hope you can understand them even though the expression is clumsy.
-- 
A preposition is not a good thing to end a sentence with.

Received on Monday, 24 March 2003 13:15:17 UTC