fulltext search in versioned content

Hi --

let me start with some background on what I'm interested in. I am hoping
for comments on whether my interest might have some merit in the context
of DASL. General comments also much appreciated :)

I am considering research into a "version-aware inverted-file full-text
indexing algorithm". The result would be an index optimized for searching
a collection of versioned documents. By 'optimized' I mean (1) taking
advantage of similarities across revisions to reduce index size, and (2)
improving search speed for large collections.
For a document that is present in 10 revisions, a "traditional" full-text
index (as for example in Lucene or Greenstone) would index 10 documents,
and add to each some versioning metadata. A version-aware approach would
be based on the premise that there is one 'base' document, and a delta for
each revision. If the revision becomes vastly different from the original,
a new base can be established.

In terms of DASL, this may be of use to implementations for full-text
content search. I would expect an inverted index to perform better than
the database-driven approach discussed in the paper on Catacomb, if I
understand their approach correctly. It would also support features such
as proximity matching.

I would welcome feedback as to the perceived merits or non-merits of this
idea in terms of DASL implementations.

Is version-aware searching a feature in demand? I haven't seen much of it
in mainstream applications.

cheers
Gerret

Received on Wednesday, 25 February 2004 00:24:04 UTC