RE: Type checking in DASL

Judith:

I think it is safe to assume that we all value 
simplicity highly. Simplicity is the reason I 
made the proposal that I did. I believe it is 
very unsimple if we start out with the going-in
assumption of producing multiple query grammars
for DASL from the outset. I believe
that if we unnecessarily produce multiple query
grammars, that would be a disservice to the
industry. (The operative word is "unnecessarily".)
One standard grammar is much
simpler than N standard grammars,
and I think this is a goal that we can easily
achieve for DASL 1.0. (Of course, I also believe 
that we should make allowance for fairly arbitrary
extensibility, so we can do whatever
we need to in the future.)

I suspect that our point of departure
may be the difference in our going-in assumptions:
I am very confident that the
one simple-but-powerful grammar that
I have proposed will suffice for 1.0 and later,
whereas it seems to me that you may be 
implicitly assuming that that will not be 
the case. Since I believe this to be a critical
issue, I'd like to explore it a bit, if I may.

I have designed and implemented multiple DBMS's
since the dawn of the DBMS era, so I have pondered
the basic issues of query in depth more than once.
It is this experience that has resulted in my
going in assumption that the grammar I have
proposed will suffice.

More recently, I was the principal author 
of the DMA query model. DMA has 
dramatically demonstrated that interoperability 
between existing legacy document management systems 
can be achieved in a matter of only a few months. That 
is an extremely impressive accomplishment: DMA 
includes the functionality of all of document management,
not just query. DMA goes way beyond the functionality 
of the simple search functionality that is believed to
be adequate for the first release of the DASL
specification. I believe we would be foolish
not to leverage the thought, the experience, 
and the lessons learned by DMA.

One of the greatest strengths of DMA is its
query model. One of the greatest strengths of the
DMA query model is that it is essentially a
parse tree model, just as the DASL query model 
essentially is. A parse tree is a universal
intermediate representation for a query.
DMA has already demonstrated that the parse
tree model easily and simultaneously intermediates
between just about any user interface query 
paradigm on the front end, and practically any
query engine interface on the back end. 

The user interface query paradigms that the
parse tree model has been demonstrated by DMA
to handle include (a) a rectangular matrix of 
pulldown boxes for operands and operators, the rows
of which are AND'ed or OR'ed together
to form the query condition, (b) various
other styles of user interface gestures to
form queries, (c) standard query languages 
that are typed in directly such as different SQL
dialects, (d) proprietary query languages that are
typed in directly, and so on. On the query engine 
side, the parse tree model has been demonstrated 
to interface well to various legacy API's, 
and to the standard and proprietary versions 
of various levels of SQL used by various different 
RDBMS's. The DMA parse tree model has anticipated
object oriented query languages, e.g. OQL, as well.

Let me try to put DASL into perspective in relation 
to DMA. DMA is an API for the entire functionality
of traditional document management including storage,
retrieval, update, versioning, containment,
and query. (A DMA document space corresponds to a
collection.) Above and beyond that, DMA provides self
describing metadata, advertising of functional
capabilities, and an extremely extensible property
based model. DMA's query model provides a
uniform view of the metadata of multiple document 
spaces, so that DMA clients treat the combined metadata
of multiple document spaces the
same as the metadata of a single document space.

An immediate implication of DMA's uniform
treatment of merged metadata is that DMA can
perform a query across multiple document spaces
as if they were a single document
space, and return the merged query results. 
For case of the multiple document space query, DMA
uses an obvious, straightforward generalization
of ANSI SQL standard three valued logic
called three valued elimination. Three valued
elimination is a simple set of rules to massage
the copy of the query that is passed to each
individual document space in a manner such that
each document space sees only legal queries.

Since the parse tree form of a query is a
universal intermediate representation, the
parse tree format intermediates between the
front end user interface query application
and all the various different back ends
simultaneously. Furthermore, by using three
valued elimination, query engines with different
capabilities as far as available query operators
can be accommodated in the same 
multi-repository query.

The DMA functionality is extremely large.
What DASL corresponds to in 
relation to the DMA functionality is merely the 
representation of the intermediate
parse tree form of the query that is sent
across the network from the client to all
the target collections. DASL doesn't
even have an API component (which is going
to save us a lot of work). Thus, DASL 
corresponds to a miniscule subset of DMA
that DMA does not even specify. (DMA specifies
the API to form the query parse tree, but allows
the parse tree to be sent across the network
in any format or formats, using any network
protocol or protocols.)

Parse trees are conceptually very simple:
There are only two things: (1) operators,
and (2) operands. And, there are only three
types of operands: (1) the results of
operators, (2) property values, (3) literal
values. DMA query has demonstrated that that is
all you need to handle the great diversity
of clients and servers discussed above,
and that you can extend the model indefinitely,
e.g., to content based retrieval and
natural language queries.

The http text for DASL will be built by programs, 
not by humans. I venture to assume that any
commercially viable UI query program could
easily convert whatever information it
has gathered from the end user and generate 
the HTTP text. I personally believe that any 
programmer smart enough to understand XML can 
easily understand my proposed query grammar.

I also venture to assume that UI query programs 
MUST deal with typing issues both on input of the
query, and on display of the results.
It is a mistake to think that clients or servers
can completely escape from all typing issues.
A query UI program has to know whether
the user is typing in a number, or
a string, or a datetime. It cannot
treat all these input fields exactly
the same way. For example, the string
representation of integers doesn't collate
properly unless it is expanded with leading
zeroes out to some maximum field length.
For example, "1000" comes before "3".
If you're typing in a date, the UI program
has to check if you're typing in nonsense
like more than 31 days in a month, 
more than 12 months in a year, etc.
Or, it has to use pulldowns to prevent
you from entering ill formed dates.
Similarly, on output, the query program
must format the query results for display
properly. Some query programs will probably 
even provide the option to sort or resort
the query results displayed on the screen.

Consider one of the simplest queries I can 
think of: Suppose I'm processing a loan
application, and I want to see all the
applicable documents -- the loan application,
the credit report, the termite report,
etc. The query condition would be "loan_number = X".
However, from real world experience I know
that some lenders use integers for loan
numbers, and that other lenders use strings for
loan numbers (because their loan numbers incorporate
letters and/or dashes).
Even this very simple, basic
query (which I assume you would agree belongs
in the basic set that DASL 1.0 must be
able to handle) requires knowledge of whether
the loan number is an integer or a string.
The UI program has to know on input (for example,
it wants to reject non numeric characters
if the loan number must be an integer),
the UI program has to know on output so it
can display it, and the query engine has
to have the HTTP DASL query translated
so that X is of the proper datatype, and if
it is the wrong type, it has to give an error.

Fortunately, typing is in the scope of DASL
according to the DASL charter.

One of the wonderful aspects of XML is that
starting and ending tags are required for
everything. So, once you grok it, you can
hardly conceive of any other way than
than expressing the query condition as a
parse tree. The fact that you should stay
away from polymorhpic operators is a bit
less obvious, until you think about the
problems you encounter in
implementations. One of the strengths of my 
proposal is that the XML syntax forces all the
types to be correct just to be well formed,
except for the type of a property and
certain literals. In other words, if the
XML document is well formed, the types
are correct, except possibly for
properties and literals. And, furthermore,
you know exactly what the datatype of the
property or literal is supposed to be.
Thus, if the XML typing proposal
comes to pass, the type enforcement will
be complete. The XML typing proposal
seems to dovetail perfectly with my
proposed query grammar.

In summary, I believe that typing is inevitable. 
DASL has to face this issue. DASL
client programs and query engines
can not avoid all typing issues.

I feel confident that people will use
DASL to perform queries, because it
is just too inefficient to do GetProp
over all the resources in a collection.
I feel confident that my going in assumption
of only one query grammar is the correct 
one, and I'd like us to give it a fair chance. 

I believe we should assume that the basic 
grammar will be extended using the approach 
I have shown by defining new operators 
in the future. For the first release of
DASL, we can define a basic
set of operators that all implementations
must supply, plus we can define as many
more optional, non controversial operators as
we want to, e.g., addition, subtraction,
multiplication, division, the SQL
LIKE pattern match operator, an IS NULL
operator, an IS UNDEFINED operator, etc.

The AND, OR, NOT operators, and the
six relational operators for each base
data type I think belong in the required
set of operators. I think we should have
some content based retrieval as well.

- - - - - - - - - - - - - - - - - - - - - - -

However, the "contains" operator is where
we step off of the well beaten path, and
where I suspect will have to spend a
lot of effort. The problem is that
everyone understands AND, OR, and NOT
and the six relational operators (and
the usual arithmetic operators), but
there are no standards for content based
retrieval operators, or for relevancy
ranking, or for hit highlighting.

Here are some of the issues for content
based retrieval hidden behind the
simple looking "contains" operator:

Search functionality includes substring matching
(case sensitive, case insensitive,
and wildcards); stemming; nearness of
words; exact phrases (e.g., "The Who");
conjunctions, disjunctions, and exclusion of 
search conditions; exclusion of
stop words; synonyms; mapping words into
concepts instead of just looking up
vocabulary that occurs in the document; etc.
How is all this functionality
specified? Which part of this functionality
is optional?

One obvious approach to use is to have
the "contains" operator take a string
parameter. The string parameter would use a
syntax that specifies the search functionality.
Let me give an example to give the flavor of 
what I mean. However, please bear in mind that this
example is in no way a proposal for a specific
string syntax. For this example, assume the string
parameter is "(computer$ # memory) <AND> gigabyte*".
This string means the document must contain
the word "computer" with stemming 
performed on it. Stemming is indicated by the 
dollar sign. The "#" means that the stemmed
word "computer" must occur "near" the word
memory, unstemmed. In addition to this
condition (the parentheses indicate grouping),
the document must also contain the a word that 
begins with the letters "gigabyte", and that 
the case must match exactly.

Preliminary to the search, the content based
retrieval index to use must be specified.
There are vendors, e.g., Oracle, that
have an integrated content based search
engine with their RDBMS. However, mostly
there are search engines out there that
search a full text index that is independent
of the document management system (or the
set of files) it indexes. In fact, the
full text index may catalog the documents
in multiple repositories, and it may catalog
multiple collections of file as well. In 
other words, the full text index can be separate 
and independent of the WebDAV collection(s)
it indexed. To complicate matters even further,
sometimes, you want to specify more than one 
full text index to be used in a search.

Variants (DMA calls these alternate
renditions) are another potential issue.
Maybe you just submit the
variant(s) you choose to the full text
indexer, and the full text search simply
returns the URL of the resource. Then,
query could retrieve whatever variant(s)
you wish by whatever the normal
mechanism we specify for DASL queries
is, not necessarily the only variant(s) that was
(were) indexed. (For example, consider an
image document and it's OCR text. You
can't put the image document in the full
text index -- you put the OCR text in
there. However, when you do a retrieval,
my guess is that you usually want to
see the image document. But then, you
may not get hit highlighting on the image.)

Integrated full text retrieval engines
might not have any problems other than,
possibly, performance, in mixing full text
conditions with conditions on "hard indexes", 
i.e., properties like "loan_number" that
you put in the RDBMS catalog of your document
management system. 

However, there may be restrictions on how 
conditions on hard properties can be combined
with full text conditions. For example,
when DMA demonstrated full text query in trial 
use, for the demo they followed the convention
that the top level operator in the parse
tree must be "AND", and that one of its operands
was only conditions on hard properties,
and its other operand was only full
text search conditions. The DMS used was
separate from the full text catalog,
so it was necessary to drive the search
from one or the other. In other words,
you either (1) perform the query on
the DMS catalog, and for every hit,
submit the content to the full text
condition, or (2) perform the query
on the full text catalog. Then, for each
document number (in DASL, resource
URL) returned, ask the DMS if the hard 
properties satisfy the hard properties
condition. The catalog that returns
the fewest results should be the
one used to drive the search. However,
the problem is to know which one
it is. Some full text engines allow
you to put hard properties in the
full text index, so for these engines,
you might be able to just submit
the query to the full text catalog.

There is no standard for relevancy
ranking. Every engine does it its own
way. Furthermore, the datatype (integer
or floating point) varies. There are
two types of relevancy ranking: (1) Relevancy
ranking wherein the relevancy ranks
depend upon the particular documents
in the corpus. (2) Relevancy ranking
where the results do not depend upon
the corpus. The relevancy rank is
relative to your query, and can be
computed using only the document and
the query. Relevancy raking of type
(1) does not scale to the enterprise,
and does not lend itself to querying
across repositories, because the
relevancy ranks from different collections
are not comparable. The second approach
does not have this problem, but not
all full text engines do it that way.

There are no standards for hit highlighting
either. Again, each engine does it its
own way. Some engines don't do it at
all. Hit highlighting is very problematic
on some types of documents. For example,
for HTML, you might embed tags. For Microsoft Word
documents, the format changes from release
to release, and may not be documented, since
it is a proprietary format. If you don't
get hit highlighting information back,
you have to do the full text search
again on the client to perform hit highlighting.
This may be sort of OK, because if you 
add up the CPU power in the network, most
of it lives on the client. However, you 
probably need a plug in for each document
format, which makes for a fat client.

Alan Babich


> -----Original Message-----
> From:	Judith Slein [SMTP:slein@wrc.xerox.com]
> Sent:	May 01, 1998 7:44 AM
> To:	Babich, Alan
> Cc:	'Saveen Reddy (Exchange)'; 'www-webdav-dasl@w3.org'; Babich,
> Alan; sembower@wrc.xerox.com
> Subject:	RE: Type checking in DASL
> 
> It's been a goal of WebDAV and of DASL to keep clients as simple as
> possible.  1.4 of the DASL specification says "It is an objective of
> DASL
> to minimize the complexity of clients . . ."  This is the major
> argument, I
> think, for keeping type checking and type conversions on the server
> side
> and keeping type-specific elements out of the query grammar.  Any
> client
> that wants to do this work is, of course, free to; but let's not force
> clients to do it just in order to construct a query.
> 
> We also provide for alternative query grammars to be carried over
> DASL, so
> it is possible for someone to define another grammar that uses
> type-specific comparison operators.  But let's not make this part of
> the
> simplesearch grammar that is required to be supported for compliance
> with
> DASL. 
> 
> --Judy
> 
> 
> Name:			Judith A. Slein
> E-Mail:		slein@wrc.xerox.com
> Internal Phone:  	8*222-5169
> Fax:			(716) 422-2938
> MailStop:		105-50C
> Web Site:    http://www.nde.wrc.xerox.com/users/Slein/slein.htm

Received on Monday, 4 May 1998 16:12:30 UTC