- From: Babich, Alan <ABabich@filenet.com>
- Date: Mon, 4 May 1998 20:00:05 -0700
- To: "'www-webdav-dasl@w3.org'" <www-webdav-dasl@w3.org>
By request, I am resending part of my e-mail (which addressed a different subject) -- the part about what some of the issues involved in content based retrieval are. I'm doing this so that e-mail threads can remain focused on one subject, and so that the subject line can therefore be more useful: - - - - - - - - - - - - - - - - - - - - - - - - - The "contains" operator is where we step off of the well beaten path, and where I suspect will have to spend a lot of effort. The problem is that everyone understands AND, OR, and NOT and the six relational operators (and the usual arithmetic operators), but there are no standards for content based retrieval operators, or for relevancy ranking, or for hit highlighting. Here are some of the issues for content based retrieval hidden behind the simple looking "contains" operator: Search functionality includes substring matching (case sensitive, case insensitive, and wildcards); stemming; nearness of words; exact phrases (e.g., "The Who"); conjunctions, disjunctions, and exclusion of subgroups of full text search conditions; exclusion of stop words; synonyms; mapping words into concepts instead of just looking up vocabulary that occurs in the document; etc. How is all this functionality specified? Which part of this functionality is optional? How do we accomodate most of the important full text engine vendors, or can we even do that? One obvious approach to use is to have the "contains" operator take a string parameter. The string parameter would use a syntax that specifies the search functionality. Let me give an example to give the flavor of what I mean. However, please bear in mind that this example is in no way a proposal for a specific string syntax. For this example, assume the string parameter is "(computer$ # memory) <AND> gigabyte*". This string means the document must contain the word "computer" with stemming performed on it. Stemming is indicated by the dollar sign. The "#" means that the stemmed word "computer" must occur "near" the word memory, unstemmed. In addition to this condition (the parentheses indicate grouping), the document must also contain the a word that begins with the letters "gigabyte", and that the case must match exactly. Preliminary to the search, the content based retrieval index to use must be specified. There are vendors, e.g., Oracle, that have an integrated content based search engine with their RDBMS. However, mostly there are search engines out there that search a full text index that is independent of the document management system (or the set of files) it indexes. In fact, the full text index may catalog the documents in multiple repositories, and it may catalog multiple collections of file as well. In other words, the full text index can be separate and independent of the WebDAV collection(s) it indexed. To complicate matters even further, sometimes, you want to specify more than one full text index to be used in a search. Variants (DMA calls these alternate renditions) are another potential issue. Maybe you just submit the variant(s) you choose to the full text indexer, and the full text search simply returns the URL of the resource. Then, query could retrieve whatever variant(s) you wish by whatever the normal mechanism we specify for DASL queries is, not necessarily the only variant(s) that was (were) indexed. (For example, consider an image document and it's OCR text. You can't put the image document in the full text index -- you put the OCR text in there. However, when you do a retrieval, my guess is that you usually want to see the image document. But then, you may not get hit highlighting on the image.) Integrated full text retrieval engines might not have any problems other than, possibly, performance, in mixing full text conditions with conditions on "hard indexes", i.e., properties like "loan_number" that you put in the RDBMS catalog of your document management system. However, there may be restrictions on how conditions on hard properties can be combined with full text conditions. For example, when DMA demonstrated full text query in trial use, for the demo they followed the convention that the top level operator in the parse tree must be "AND", and that one of its operands was only conditions on hard properties, and its other operand was only full text search conditions. The DMS used was separate from the full text catalog, so it was necessary to drive the search from one or the other. In other words, you either (1) perform the query on the DMS catalog, and for every hit, submit the content to the full text condition, or (2) perform the query on the full text catalog. Then, for each document number (in DASL, resource URL) returned, ask the DMS if the hard properties satisfy the hard properties condition. The catalog that returns the fewest results should be the one used to drive the search. However, the problem is to know which one it is. Some full text engines allow you to put hard properties in the full text index, so for these engines, you might be able to just submit the query to the full text catalog. There is no standard for relevancy ranking. Every engine does it its own way. Furthermore, the datatype (integer or floating point) varies. There are two types of relevancy ranking: (1) Relevancy ranking wherein the relevancy ranks depend upon the particular documents in the corpus. (2) Relevancy ranking where the results do not depend upon the corpus. The relevancy rank is relative to your query, and can be computed using only the document and the query. Relevancy raking of type (1) does not scale to the enterprise, and does not lend itself to querying across repositories, because the relevancy ranks from different collections are not comparable. The second approach does not have this problem, but not all full text engines do it that way. There are no standards for hit highlighting either. Again, each engine does it its own way. Some engines don't do it at all. Hit highlighting is very problematic on some types of documents. For example, for HTML, you might embed tags. For Microsoft Word documents, the format changes from release to release, and may not be documented, since it is a proprietary format. If you don't get hit highlighting information back, you have to do the full text search again on the client to perform hit highlighting. This may be sort of OK, because if you add up the CPU power in the network, most of it lives on the client. However, you probably need a plug in for each document format, which makes for a fat client. Alan Babich
Received on Monday, 4 May 1998 23:02:21 UTC