Re: FTS comments

On Mon, Mar 24, 2003 at 07:14:34PM +0100, Kai Großjohann wrote:
> Proximity querying is a good thing, but linguistic phrase search is
> even better :-)  -- at least for some common use cases.

To some extent this has to be a qulity of implementation issue.

We want to end up with a specification that everyone will implement
completely and interoperably, and that will work for all languages,
without intellectual property rights issues on the algorithms.

> (Just like wildcards are a good thing, but often, stemming is even
> better.)

Wildcards and stemming provide different (but overlapping) functions.  
You can use run* to find running (but not ran); you can't use stemming
to find both Cardinality and Card unless the stemming is overly
aggressive (and then a search for cadinals would find gaming houses).

More formally, wildcards are often useful precisely because they do
not operate on linguistic principles, but are purely lexical.

> So by focussing on the ordered/unordered proximity search with
> various constraints, system containing a linguistic parser are left
> standing in the rain.

I can imagine requiring linguistic analysis delaying publication of a
specification by many years while research is done to ensure
royalty-free and public algorithms exist sufficient for interoperability.

Conformance testing means that the same query must generally provide
the same results on all systems, when run on the same data, at least
in the default mode.  Otherwise it's very hard to test if the system is
functioning correctly.

So that implies to me that at the very least a base conformance level
would need to use only lexical, not linguistic, techniques.

> Note that the FTS document does not say how to implement stemming.
It may be necessary to do so eventually, to a degree sufficient for
interoperability, but I will defer here to any review by the W3C
Internationalization Working Group, which I expect very likely before
a final full text search specification could be published by the W3C.

I think the challenge will be to come up with a spec small enough
and simple enough that it can be reviewed, understood, implemented
and tested, yet large enough to be useful, and extensible enough that
the sort of features you describe can be added in later versions, or
by individual implementations.  And to do so in under five years :-)

Or that's my take at least, but I hope it's useful to you.

Liam

-- 
Liam Quin, W3C XML Activity Lead, liam@w3.org, http://www.w3.org/People/Quin/
http://www.holoweb.net/~liam/

Received on Thursday, 27 March 2003 15:41:46 UTC