- From: Babich, Alan <ABabich@filenet.com>
- Date: Mon, 4 May 1998 13:10:00 -0700
- To: "'Judith Slein'" <slein@wrc.xerox.com>
- Cc: "'Saveen Reddy (Exchange)'" <saveenr@Exchange.Microsoft.com>, "'www-webdav-dasl@w3.org'" <www-webdav-dasl@w3.org>, "Babich, Alan" <ABabich@felix.filenet.com>, sembower@wrc.xerox.com
Judith: I think it is safe to assume that we all value simplicity highly. Simplicity is the reason I made the proposal that I did. I believe it is very unsimple if we start out with the going-in assumption of producing multiple query grammars for DASL from the outset. I believe that if we unnecessarily produce multiple query grammars, that would be a disservice to the industry. (The operative word is "unnecessarily".) One standard grammar is much simpler than N standard grammars, and I think this is a goal that we can easily achieve for DASL 1.0. (Of course, I also believe that we should make allowance for fairly arbitrary extensibility, so we can do whatever we need to in the future.) I suspect that our point of departure may be the difference in our going-in assumptions: I am very confident that the one simple-but-powerful grammar that I have proposed will suffice for 1.0 and later, whereas it seems to me that you may be implicitly assuming that that will not be the case. Since I believe this to be a critical issue, I'd like to explore it a bit, if I may. I have designed and implemented multiple DBMS's since the dawn of the DBMS era, so I have pondered the basic issues of query in depth more than once. It is this experience that has resulted in my going in assumption that the grammar I have proposed will suffice. More recently, I was the principal author of the DMA query model. DMA has dramatically demonstrated that interoperability between existing legacy document management systems can be achieved in a matter of only a few months. That is an extremely impressive accomplishment: DMA includes the functionality of all of document management, not just query. DMA goes way beyond the functionality of the simple search functionality that is believed to be adequate for the first release of the DASL specification. I believe we would be foolish not to leverage the thought, the experience, and the lessons learned by DMA. One of the greatest strengths of DMA is its query model. One of the greatest strengths of the DMA query model is that it is essentially a parse tree model, just as the DASL query model essentially is. A parse tree is a universal intermediate representation for a query. DMA has already demonstrated that the parse tree model easily and simultaneously intermediates between just about any user interface query paradigm on the front end, and practically any query engine interface on the back end. The user interface query paradigms that the parse tree model has been demonstrated by DMA to handle include (a) a rectangular matrix of pulldown boxes for operands and operators, the rows of which are AND'ed or OR'ed together to form the query condition, (b) various other styles of user interface gestures to form queries, (c) standard query languages that are typed in directly such as different SQL dialects, (d) proprietary query languages that are typed in directly, and so on. On the query engine side, the parse tree model has been demonstrated to interface well to various legacy API's, and to the standard and proprietary versions of various levels of SQL used by various different RDBMS's. The DMA parse tree model has anticipated object oriented query languages, e.g. OQL, as well. Let me try to put DASL into perspective in relation to DMA. DMA is an API for the entire functionality of traditional document management including storage, retrieval, update, versioning, containment, and query. (A DMA document space corresponds to a collection.) Above and beyond that, DMA provides self describing metadata, advertising of functional capabilities, and an extremely extensible property based model. DMA's query model provides a uniform view of the metadata of multiple document spaces, so that DMA clients treat the combined metadata of multiple document spaces the same as the metadata of a single document space. An immediate implication of DMA's uniform treatment of merged metadata is that DMA can perform a query across multiple document spaces as if they were a single document space, and return the merged query results. For case of the multiple document space query, DMA uses an obvious, straightforward generalization of ANSI SQL standard three valued logic called three valued elimination. Three valued elimination is a simple set of rules to massage the copy of the query that is passed to each individual document space in a manner such that each document space sees only legal queries. Since the parse tree form of a query is a universal intermediate representation, the parse tree format intermediates between the front end user interface query application and all the various different back ends simultaneously. Furthermore, by using three valued elimination, query engines with different capabilities as far as available query operators can be accommodated in the same multi-repository query. The DMA functionality is extremely large. What DASL corresponds to in relation to the DMA functionality is merely the representation of the intermediate parse tree form of the query that is sent across the network from the client to all the target collections. DASL doesn't even have an API component (which is going to save us a lot of work). Thus, DASL corresponds to a miniscule subset of DMA that DMA does not even specify. (DMA specifies the API to form the query parse tree, but allows the parse tree to be sent across the network in any format or formats, using any network protocol or protocols.) Parse trees are conceptually very simple: There are only two things: (1) operators, and (2) operands. And, there are only three types of operands: (1) the results of operators, (2) property values, (3) literal values. DMA query has demonstrated that that is all you need to handle the great diversity of clients and servers discussed above, and that you can extend the model indefinitely, e.g., to content based retrieval and natural language queries. The http text for DASL will be built by programs, not by humans. I venture to assume that any commercially viable UI query program could easily convert whatever information it has gathered from the end user and generate the HTTP text. I personally believe that any programmer smart enough to understand XML can easily understand my proposed query grammar. I also venture to assume that UI query programs MUST deal with typing issues both on input of the query, and on display of the results. It is a mistake to think that clients or servers can completely escape from all typing issues. A query UI program has to know whether the user is typing in a number, or a string, or a datetime. It cannot treat all these input fields exactly the same way. For example, the string representation of integers doesn't collate properly unless it is expanded with leading zeroes out to some maximum field length. For example, "1000" comes before "3". If you're typing in a date, the UI program has to check if you're typing in nonsense like more than 31 days in a month, more than 12 months in a year, etc. Or, it has to use pulldowns to prevent you from entering ill formed dates. Similarly, on output, the query program must format the query results for display properly. Some query programs will probably even provide the option to sort or resort the query results displayed on the screen. Consider one of the simplest queries I can think of: Suppose I'm processing a loan application, and I want to see all the applicable documents -- the loan application, the credit report, the termite report, etc. The query condition would be "loan_number = X". However, from real world experience I know that some lenders use integers for loan numbers, and that other lenders use strings for loan numbers (because their loan numbers incorporate letters and/or dashes). Even this very simple, basic query (which I assume you would agree belongs in the basic set that DASL 1.0 must be able to handle) requires knowledge of whether the loan number is an integer or a string. The UI program has to know on input (for example, it wants to reject non numeric characters if the loan number must be an integer), the UI program has to know on output so it can display it, and the query engine has to have the HTTP DASL query translated so that X is of the proper datatype, and if it is the wrong type, it has to give an error. Fortunately, typing is in the scope of DASL according to the DASL charter. One of the wonderful aspects of XML is that starting and ending tags are required for everything. So, once you grok it, you can hardly conceive of any other way than than expressing the query condition as a parse tree. The fact that you should stay away from polymorhpic operators is a bit less obvious, until you think about the problems you encounter in implementations. One of the strengths of my proposal is that the XML syntax forces all the types to be correct just to be well formed, except for the type of a property and certain literals. In other words, if the XML document is well formed, the types are correct, except possibly for properties and literals. And, furthermore, you know exactly what the datatype of the property or literal is supposed to be. Thus, if the XML typing proposal comes to pass, the type enforcement will be complete. The XML typing proposal seems to dovetail perfectly with my proposed query grammar. In summary, I believe that typing is inevitable. DASL has to face this issue. DASL client programs and query engines can not avoid all typing issues. I feel confident that people will use DASL to perform queries, because it is just too inefficient to do GetProp over all the resources in a collection. I feel confident that my going in assumption of only one query grammar is the correct one, and I'd like us to give it a fair chance. I believe we should assume that the basic grammar will be extended using the approach I have shown by defining new operators in the future. For the first release of DASL, we can define a basic set of operators that all implementations must supply, plus we can define as many more optional, non controversial operators as we want to, e.g., addition, subtraction, multiplication, division, the SQL LIKE pattern match operator, an IS NULL operator, an IS UNDEFINED operator, etc. The AND, OR, NOT operators, and the six relational operators for each base data type I think belong in the required set of operators. I think we should have some content based retrieval as well. - - - - - - - - - - - - - - - - - - - - - - - However, the "contains" operator is where we step off of the well beaten path, and where I suspect will have to spend a lot of effort. The problem is that everyone understands AND, OR, and NOT and the six relational operators (and the usual arithmetic operators), but there are no standards for content based retrieval operators, or for relevancy ranking, or for hit highlighting. Here are some of the issues for content based retrieval hidden behind the simple looking "contains" operator: Search functionality includes substring matching (case sensitive, case insensitive, and wildcards); stemming; nearness of words; exact phrases (e.g., "The Who"); conjunctions, disjunctions, and exclusion of search conditions; exclusion of stop words; synonyms; mapping words into concepts instead of just looking up vocabulary that occurs in the document; etc. How is all this functionality specified? Which part of this functionality is optional? One obvious approach to use is to have the "contains" operator take a string parameter. The string parameter would use a syntax that specifies the search functionality. Let me give an example to give the flavor of what I mean. However, please bear in mind that this example is in no way a proposal for a specific string syntax. For this example, assume the string parameter is "(computer$ # memory) <AND> gigabyte*". This string means the document must contain the word "computer" with stemming performed on it. Stemming is indicated by the dollar sign. The "#" means that the stemmed word "computer" must occur "near" the word memory, unstemmed. In addition to this condition (the parentheses indicate grouping), the document must also contain the a word that begins with the letters "gigabyte", and that the case must match exactly. Preliminary to the search, the content based retrieval index to use must be specified. There are vendors, e.g., Oracle, that have an integrated content based search engine with their RDBMS. However, mostly there are search engines out there that search a full text index that is independent of the document management system (or the set of files) it indexes. In fact, the full text index may catalog the documents in multiple repositories, and it may catalog multiple collections of file as well. In other words, the full text index can be separate and independent of the WebDAV collection(s) it indexed. To complicate matters even further, sometimes, you want to specify more than one full text index to be used in a search. Variants (DMA calls these alternate renditions) are another potential issue. Maybe you just submit the variant(s) you choose to the full text indexer, and the full text search simply returns the URL of the resource. Then, query could retrieve whatever variant(s) you wish by whatever the normal mechanism we specify for DASL queries is, not necessarily the only variant(s) that was (were) indexed. (For example, consider an image document and it's OCR text. You can't put the image document in the full text index -- you put the OCR text in there. However, when you do a retrieval, my guess is that you usually want to see the image document. But then, you may not get hit highlighting on the image.) Integrated full text retrieval engines might not have any problems other than, possibly, performance, in mixing full text conditions with conditions on "hard indexes", i.e., properties like "loan_number" that you put in the RDBMS catalog of your document management system. However, there may be restrictions on how conditions on hard properties can be combined with full text conditions. For example, when DMA demonstrated full text query in trial use, for the demo they followed the convention that the top level operator in the parse tree must be "AND", and that one of its operands was only conditions on hard properties, and its other operand was only full text search conditions. The DMS used was separate from the full text catalog, so it was necessary to drive the search from one or the other. In other words, you either (1) perform the query on the DMS catalog, and for every hit, submit the content to the full text condition, or (2) perform the query on the full text catalog. Then, for each document number (in DASL, resource URL) returned, ask the DMS if the hard properties satisfy the hard properties condition. The catalog that returns the fewest results should be the one used to drive the search. However, the problem is to know which one it is. Some full text engines allow you to put hard properties in the full text index, so for these engines, you might be able to just submit the query to the full text catalog. There is no standard for relevancy ranking. Every engine does it its own way. Furthermore, the datatype (integer or floating point) varies. There are two types of relevancy ranking: (1) Relevancy ranking wherein the relevancy ranks depend upon the particular documents in the corpus. (2) Relevancy ranking where the results do not depend upon the corpus. The relevancy rank is relative to your query, and can be computed using only the document and the query. Relevancy raking of type (1) does not scale to the enterprise, and does not lend itself to querying across repositories, because the relevancy ranks from different collections are not comparable. The second approach does not have this problem, but not all full text engines do it that way. There are no standards for hit highlighting either. Again, each engine does it its own way. Some engines don't do it at all. Hit highlighting is very problematic on some types of documents. For example, for HTML, you might embed tags. For Microsoft Word documents, the format changes from release to release, and may not be documented, since it is a proprietary format. If you don't get hit highlighting information back, you have to do the full text search again on the client to perform hit highlighting. This may be sort of OK, because if you add up the CPU power in the network, most of it lives on the client. However, you probably need a plug in for each document format, which makes for a fat client. Alan Babich > -----Original Message----- > From: Judith Slein [SMTP:slein@wrc.xerox.com] > Sent: May 01, 1998 7:44 AM > To: Babich, Alan > Cc: 'Saveen Reddy (Exchange)'; 'www-webdav-dasl@w3.org'; Babich, > Alan; sembower@wrc.xerox.com > Subject: RE: Type checking in DASL > > It's been a goal of WebDAV and of DASL to keep clients as simple as > possible. 1.4 of the DASL specification says "It is an objective of > DASL > to minimize the complexity of clients . . ." This is the major > argument, I > think, for keeping type checking and type conversions on the server > side > and keeping type-specific elements out of the query grammar. Any > client > that wants to do this work is, of course, free to; but let's not force > clients to do it just in order to construct a query. > > We also provide for alternative query grammars to be carried over > DASL, so > it is possible for someone to define another grammar that uses > type-specific comparison operators. But let's not make this part of > the > simplesearch grammar that is required to be supported for compliance > with > DASL. > > --Judy > > > Name: Judith A. Slein > E-Mail: slein@wrc.xerox.com > Internal Phone: 8*222-5169 > Fax: (716) 422-2938 > MailStop: 105-50C > Web Site: http://www.nde.wrc.xerox.com/users/Slein/slein.htm
Received on Monday, 4 May 1998 16:12:30 UTC