"contains" must be optional

The "contains" CBR operator as currently specified 
has a few problems. One problem is that it is a required 
operator. The problem with that is that, is that it raises 
the barrier of entry way to high for many existing systems. 
(See "Barrier of entry must be low", 6/30/98.)

Point 1: 

One system I am familiar with is a high volume document imaging 
system. Billions of dollars worth of this system have been
sold, and it is still going strong. The documents are scanned in.
There is no text version of the image documents.
There is no possibility of content based retrieval unless
there is a text based version of the documents. The system
has been shipping for eleven years, and there is so little
demand for OCR of these documents, that OCR and text based
retrieval are still not provided. Content based retrieval
is obviously not a significant benefit for applications
using this system.


Point 2:

I believe the situation is the same for other high volume
imaging systems.


Point 3:

Another system I am familiar with manages electronic
documents, i.e., documents containing text, typed by humans,
that software understands. This system is one of the top 
few in market share. Content retrieval is provided with 
this system. However, most customers of this system do not 
install it. Of those that install it, most don't find it
useful. Less than 10% of the customers of this system use 
content based retrieval. My supposition is that the state
of the art of content based retrieval is not yet good enough
for the other 90% of the customers.


Point 4:

I believe that the situation is the same for other EDM systems.


Point 5:

Consider what it would take for a system to offer content
based retrieval if it is not already doing so: Implementing
a CBR engine would be a lot of work. Considering the maturity
of the CBR industry, implementing another CBR engine at this
point in time would be a very poor business decision for 
almost all companies. Therefore, the company involved would 
probably look to an existing CBR company to provide this
functionality. The first step is to evaluate the candidates.
This takes time, money, and effort. The second step is to 
strike a deal with at least one of them. The third step is
to integrate with the CBR software. There would have to be
training and education for engineering, documentation, 
marketing, and support. There is an ongoing support cost
and release upgrades.

Point 6:

We must be inclusive. (See "Barrier of entry must be low", 
6/30/98.) Excluding whole classes of such systems from DASL 
would be a mistake.


Point 7:

On the other hand, we should not drop full text retrieval
functionality out of DASL for 1.0, because it is clearly
extremely useful for some applications. This is true
whether you believe in having a single poorly defined 
operator like "contains", a set of crisply defined 
CBR operators, or both.


Conclusion:

By now you've figured it out. If the system isn't already
supporting CBR, the costs outweigh the benefits, and such
a system can not implement it just for DASL. In other words,
making "contains" required raises the barrier of entry
way too high. Therefore, all CBR operators must be optional.

(To be clear: I do NOT include string pattern match operators, 
e.g., the SQL LIKE operator in the CBR operator category. 
String pattern match operators are a separate discussion 
altogether. I will discuss these in a separate e-mail.)


Immediate Consequence:

An immediate consequence of all CBR operators being optional, 
is that the query operators supported by a collection must be 
advertised by the collection. DASL can NOT take the position
that all the operators that can and must be implemented are 
in the DASL 1.0 spec., so they don't have to be advertised.
Our charter requires us to advertise query capabilities,
so we shouldn't have any reservations about advertising
the query operators supported. We will have to 
do it in later versions of the spec. when additional 
optional query operators are introduced in any case.
It's really easy to do.


Alan Babich

Received on Tuesday, 30 June 1998 18:18:53 UTC