DASL Issues for Content Based Retrieval

By request, I am resending part of my e-mail (which
addressed a different subject) -- the part about
what some of the issues involved in content based
retrieval are. I'm doing this so that e-mail threads can remain
focused on one subject, and so that the subject
line can therefore be more useful:

- - - - - - - - - - - - - - - - - - - - - - - - -

The "contains" operator is where
we step off of the well beaten path, and
where I suspect will have to spend a
lot of effort. The problem is that
everyone understands AND, OR, and NOT
and the six relational operators (and
the usual arithmetic operators), but
there are no standards for content based
retrieval operators, or for relevancy
ranking, or for hit highlighting.

Here are some of the issues for content
based retrieval hidden behind the
simple looking "contains" operator:

Search functionality includes substring matching
(case sensitive, case insensitive,
and wildcards); stemming; nearness of
words; exact phrases (e.g., "The Who");
conjunctions, disjunctions, and exclusion of 
subgroups of full text search conditions; exclusion of
stop words; synonyms; mapping words into
concepts instead of just looking up
vocabulary that occurs in the document; etc.
How is all this functionality
specified? Which part of this functionality
is optional? How do we accomodate
most of the important full text engine
vendors, or can we even do that?

One obvious approach to use is to have
the "contains" operator take a string
parameter. The string parameter would use a
syntax that specifies the search functionality.
Let me give an example to give the flavor of 
what I mean. However, please bear in mind that this
example is in no way a proposal for a specific
string syntax. For this example, assume the string
parameter is "(computer$ # memory) <AND> gigabyte*".
This string means the document must contain
the word "computer" with stemming 
performed on it. Stemming is indicated by the 
dollar sign. The "#" means that the stemmed
word "computer" must occur "near" the word
memory, unstemmed. In addition to this
condition (the parentheses indicate grouping),
the document must also contain the a word that 
begins with the letters "gigabyte", and that 
the case must match exactly.

Preliminary to the search, the content based
retrieval index to use must be specified.
There are vendors, e.g., Oracle, that
have an integrated content based search
engine with their RDBMS. However, mostly
there are search engines out there that
search a full text index that is independent
of the document management system (or the
set of files) it indexes. In fact, the
full text index may catalog the documents
in multiple repositories, and it may catalog
multiple collections of file as well. In 
other words, the full text index can be separate 
and independent of the WebDAV collection(s)
it indexed. To complicate matters even further,
sometimes, you want to specify more than one 
full text index to be used in a search.

Variants (DMA calls these alternate
renditions) are another potential issue.
Maybe you just submit the
variant(s) you choose to the full text
indexer, and the full text search simply
returns the URL of the resource. Then,
query could retrieve whatever variant(s)
you wish by whatever the normal
mechanism we specify for DASL queries
is, not necessarily the only variant(s) that was
(were) indexed. (For example, consider an
image document and it's OCR text. You
can't put the image document in the full
text index -- you put the OCR text in
there. However, when you do a retrieval,
my guess is that you usually want to
see the image document. But then, you
may not get hit highlighting on the image.)

Integrated full text retrieval engines
might not have any problems other than,
possibly, performance, in mixing full text
conditions with conditions on "hard indexes", 
i.e., properties like "loan_number" that
you put in the RDBMS catalog of your document
management system. 

However, there may be restrictions on how 
conditions on hard properties can be combined
with full text conditions. For example,
when DMA demonstrated full text query in trial 
use, for the demo they followed the convention
that the top level operator in the parse
tree must be "AND", and that one of its operands
was only conditions on hard properties,
and its other operand was only full
text search conditions. The DMS used was
separate from the full text catalog,
so it was necessary to drive the search
from one or the other. In other words,
you either (1) perform the query on
the DMS catalog, and for every hit,
submit the content to the full text
condition, or (2) perform the query
on the full text catalog. Then, for each
document number (in DASL, resource
URL) returned, ask the DMS if the hard 
properties satisfy the hard properties
condition. The catalog that returns
the fewest results should be the
one used to drive the search. However,
the problem is to know which one
it is. Some full text engines allow
you to put hard properties in the
full text index, so for these engines,
you might be able to just submit
the query to the full text catalog.

There is no standard for relevancy
ranking. Every engine does it its own
way. Furthermore, the datatype (integer
or floating point) varies. There are
two types of relevancy ranking: (1) Relevancy
ranking wherein the relevancy ranks
depend upon the particular documents
in the corpus. (2) Relevancy ranking
where the results do not depend upon
the corpus. The relevancy rank is
relative to your query, and can be
computed using only the document and
the query. Relevancy raking of type
(1) does not scale to the enterprise,
and does not lend itself to querying
across repositories, because the
relevancy ranks from different collections
are not comparable. The second approach
does not have this problem, but not
all full text engines do it that way.

There are no standards for hit highlighting
either. Again, each engine does it its
own way. Some engines don't do it at
all. Hit highlighting is very problematic
on some types of documents. For example,
for HTML, you might embed tags. For Microsoft Word
documents, the format changes from release
to release, and may not be documented, since
it is a proprietary format. If you don't
get hit highlighting information back,
you have to do the full text search
again on the client to perform hit highlighting.
This may be sort of OK, because if you 
add up the CPU power in the network, most
of it lives on the client. However, you 
probably need a plug in for each document
format, which makes for a fat client.

Alan Babich

Received on Monday, 4 May 1998 23:02:21 UTC