Re: fulltext search in versioned content from Edward C. Zimmermann on 2004-02-25 (www-webdav-dasl@w3.org from January to March 2004)

From: Edward C. Zimmermann <edz@elmyra.bsn.com>
Date: Wed, 25 Feb 2004 09:19:20 +0100 (MET)
To: ga11@cs.waikato.ac.nz
Cc: www-webdav-dasl@w3.org
Message-Id: <200402250819.i1P8JK412784@elmyra.bsn.com>

>
>Hi --
>
>let me start with some background on what I'm interested in. I am hoping
>for comments on whether my interest might have some merit in the context
>of DASL. General comments also much appreciated :)
>
>I am considering research into a "version-aware inverted-file full-text
>indexing algorithm". The result would be an index optimized for searching

A few points:

- Given the current rate of storage cost versus I/O and processing costs, index
  size is no longer the issue.it was a decade ago. While collections of many
  GBs and millions of records (beyond the task of crawling and replicating
  "Internet Web pages") are not pedestrian, sufficient storage is available on
  nearly any personal computer sold at the supermarkets today (these I think
  are typically no less than 80 GB this week).
  
  The limiting factor continues to be I/O and especially the capacities of
  32-bit kernel memory management of the operating systems that still are
  dominant.

- small embeded systems might not have storage but I have trouble thinking
  of a S/R application on an embeded system that would demand more than what
  these things already seem to have--- we might not be able to fit all the
  NIH human genome records or USPTO's patents but I don't see why one would
  need to (and for these we have more conventional computers).

- Inverted-file algorithms are not terribly good at handling large amounts
  of data.. and more importantly handling fields and structure.


If you are interested in indexing only context diffs but in searching the
entire of the rendered document one would need an additional kind of
diff between that document and a fixed reference point and search via
de-referencing to a complex document that would contain both "fragements".
During presentation one would be to then reconstruct the document. This
strikes me as more costly than just indexing the documents and handling the
versioning on another layer.

Our "fulltext engine", in fact, contains a versioning facility. We tend to
use very straightforward approaches such as handling all the versions as
fully rendered documents but with the system aware of the versioning and
what to do-- this tends to depend upon what we need or want to do
(application/project specific). 


______________________
Edward C. Zimmermann, Basis Systeme netzwerk, Munich
<A 
HREF="http://www.stadtplandienst.de/query;ORT=m;PLZ=80802;STR=Leopoldstr%2E;HNR=
53;GR=2;PRINTER_FRIENDLY=TRUE">Leopoldstrasse 53-55, D-80802 Munich, Federal 
Republic of Germany</A>
Telephone:   Voice:= +49 (89) 385-47074  Fax:= +49 (89)  692-8150
          Cellular:= +49 (179) 205-0539

Received on Wednesday, 25 February 2004 03:19:39 UTC