- From: Babich, Alan <ABabich@filenet.com>
- Date: Mon, 1 Jun 1998 20:47:21 -0700
- To: "'Jim Davis'" <jdavis@parc.xerox.com>, www-webdav-dasl@w3.org
- Cc: "Babich, Alan" <ABabich@felix.filenet.com>
In my previous e-mail, I discussed and invalidated C2 through C4 as arguments against client side error checking. In this e-mail, as promised in my previous e-mail, I discuss C1. C1 is: "C1. It makes the query syntax more complicated. Compare <where> <boolean_op> <eq> <string_expr> <string_prop>dav:getcontenttype</string_prop> <string_expr> <string_expr> <string_const>image/gif</string_const> </string_expr> </eq> </boolean_op> </where> with <where> <eq> <prop>dav:getcontenttype</prop> <string_const>image/gif</string_const> </eq> </where>" I wouldn't use the word "complicated" to discuss what happened. I would use some word like "undesirably verbose" or "has redundant tags". Furthermore, I wouldn't blame it on the proposal. I would blame it on the XML DTD language. The problem with the XML DTD language is that it is trying to do one too many favors for you. It does you the "favor" of automatically "inserting" tags. The other side of that coin is that you loose control of which tags have to appear in the source document to be valid. However, nit picking aside, there is still a real difference in verbosity of the queries. My proposal results in your taking the query you want to have, and then bracketing every occurrence of an operator with operator result-type tags, and also bracketing every operand with expression type tags. In the above example, the <eq> operator is enclosed in <boolean_op> brackets, and its two operands are each enclosed in <string_expr> tags. These extra tags are caused by a bit of factoring in the DTD. I would not have done that if there were a reasonable way to avoid these extra tags. However, I believe there is not. Let us see what happens if we try to eliminate the extra tags. The "where" condition isn't too bad. It becomes: <!ELEMENT where ( and | or | not | gt | gte | eq | ne | ls | lse ) > Every time you add a new Boolean operator, you have to extend the list of alternatives. Let's take the first operator, "and". In my proposal, it is <!ELEMENT and ( boolean_expr , boolean_expr+ ) > and boolean_expr is <!ELEMENT boolean_expr ( boolean_op | boolean_prop | boolean_const ) > <!ELEMENT boolean_prop ( #PCDATA ) > <!ELEMENT boolean_const ( "t" | "f" | "unknown" ) > Note that we have an N-ary AND operator, which is good, and we NEVER have to change the definition of the AND operator, no matter what new base data types, new Boolean operators, new Boolean constants, or new operators returning other types we invent now or in the future. In order to get rid of the extra tags, we have to eliminate <boolean_expr> and <boolean_op>. Then, the AND operator becomes <!ELEMENT and ( ( and , and ) | ( and , or ) | ( and , not ) | ( and , gt ) | ( and , gte ) | ( and , eq ) | ( and , ne ) | ( and , ls ) | ( and , lse ) | ( and , boolean_prop ) | ( and , boolean_const ) | ( or , and ) | ( or , or ) | ( or , not ) | ( or , gt ) | ( or , gte ) | ( or , eq ) | ( or , ne ) | ( or , ls ) | ( or , lse ) | ( or , boolean_prop ) | ( or , boolean_const ) | ( gt , and ) | ( gt , or ) | ( gt , not ) | ( gt , gt ) | ( gt , gte ) | ( gt , eq ) | ( gt , ne ) | ( gt , ls ) | ( gt , lse ) | ( gt , boolean_prop ) | ( gt , boolean_const ) | ( gte , and ) | ( gte , or ) | ( gte , not ) | ( gte , gt ) | ( gte , gte ) | ( gte , eq ) | ( gte , ne ) | ( gte , ls ) | ( gte , lse ) | ( gte , boolean_prop ) | ( gte , boolean_const ) | ( eq , and ) | ( eq , or ) | ( eq , not ) | ( eq , gt ) | ( eq , gte ) | ( eq , eq ) | ( eq , ne ) | ( eq , ls ) | ( eq , lse ) | ( eq , boolean_prop ) | ( eq , boolean_const ) | ( ne , and ) | ( ne , or ) | ( ne , not ) | ( ne , gt ) | ( ne , gte ) | ( ne , eq ) | ( ne , ne ) | ( ne , ls ) | ( ne , lse ) | ( ne , boolean_prop ) | ( ne , boolean_const ) | ( ls , and ) | ( ls , or ) | ( ls , not ) | ( ls , gt ) | ( ls , gte ) | ( ls , eq ) | ( ls , ne ) | ( ls , ls ) | ( ls , lse ) | ( ls , boolean_prop ) | ( ls , boolean_const ) | ( lse , and ) | ( lse , or ) | ( lse , not ) | ( lse , gt ) | ( lse , gte ) | ( lse , eq ) | ( lse , ne ) | ( lse , ls ) | ( lse , lse ) | ( lse , boolean_prop ) | ( lse , boolean_const ) | ( boolean_prop , and ) | ( boolean_prop , or ) | ( boolean_prop , not ) | ( boolean_prop , gt ) | ( boolean_prop , gte ) | ( boolean_prop , eq ) | ( boolean_prop , ne ) | ( boolean_prop , ls ) | ( boolean_prop , lse ) | ( boolean_prop , boolean_prop ) | ( boolean_prop , boolean_const ) | ( boolean_const , and ) | ( boolean_const , or ) | ( boolean_const , not ) | ( boolean_const , gt ) | ( boolean_const , gte ) | ( boolean_const , eq ) | ( boolean_const , ne ) | ( boolean_const , ls ) | ( boolean_const , lse ) | ( boolean_const , boolean_prop ) | ( boolean_const , boolean_const ) | etc., etc., etc. I have carried out the expansion to only 2-ary AND operators. For 3-ary AND operators, there are 11 cubed more alternatives (1,331). For 4-ary AND operators, there are 11 to the 4th power more alternatives (14,641). For 5-ary AND operators, there are 11 to the 5th power more alternatives (161,051). Etc. Clearly, at some value of N, this becomes unreasonable (in case you don't already think it's unreasonable). Even if you don't think it's unreasonable, for large enough N, you would blow out the capacity of any parser, and the patience of any end user waiting while this stuff is transmitted over the network during discovery of query capabilities. So, it actually is unreasonable. Furthermore, every time you add a new operator, you increase the number that is being raised to the Nth power. For example, suppose we add IS_NUL and PROPERTY_IS_DEFINED. That's two more Boolean operators. So, instead of 11**2 + 11**3 + 11**4 + ... we have 13**2 + 13**3 + 13**4 + ... alternatives. (Here "**" indicates exponentiation.) In contrast, remember, that in my proposal, the AND operator is totally unaffected when new Boolean operators are added. And that's just the AND operator. The OR operator has just as many cases. And the AND and OR operators are not even polymorhpic. The six comparison operators ARE polymorhpic. In order to make the grammar finite, you MUST eliminate N-ary operators. But you are still left with a case explosion that gets worse as you add new operators. And adding new operators affects the definition of old operators. Since infinitely large grammars are not feasible, I made the proposal that I did. I avoided factoring real and integer expressions by not inventing a tag for "numeric expression". That cost 4 cases as opposed to 1 in all of the six comparison operators. I thought that was a reasonable tradeoff, since the comparison operators are binary, not N-ary, and we ended up with only 6 cases (instead of 3). So, I conclude that the XML language forces you to do a certain amount of factoring in most situations. So be it. Why worry about it? What's the big deal? Remember, the DTD is important too. It not only has to be processed by software, it has to be read and understood by mere mortals. It is also important to keep the DTD clear and compact. So, unlike C2 through C4, C1 correctly (if you allow me to reword it properly) asserts that you end up with more tags than you wish you had, but, given that you want to define correctly constructed queries as being synonymous with syntactically valid XML documents, you have to have the extra tags. So, let us explore that definition. There is a difference between syntactic and semantic errors. Syntactic errors can be discovered by merely looking at the query. Many semantic errors must take other knowledge (e.g., the schema of the collection and the intent of the user) into account, and can only be discovered at runtime. For example, suppose we have a multiplication operator. (I bet we will at some time in the future, but maybe not in DASL 1.0.) Then, if you multiply two dates, then, regardless of the schema of the collection, that is an incorrectly constructed query. My proposal obviously would catch that error. It is not necessary to even let a user type in such a query, even if you know nothing about the schema of the collection. It would be valuable for the UI to prevent the user from entering such a nonsense query. My proposal would catch a host of other errors as well. For example, my proposal insures that the outermost operator returns a Boolean result. I think it would be very illuminating to see what would happen if you didn't care about the types of expressions. In that case, the following simple grammar would arise: <!ELEMENT where ( op ) > <!ELEMENT op ( and | or | not | gt | gte | eq | ne | ls | lse ) > <!ELEMENT expr ( op | prop | const ) > <!ELEMENT prop ( #PCDATA ) > <!ELEMENT const ( #PCDATA ) > <!ELEMENT and ( expr , expr+ ) > <!ELEMENT or ( expr , expr+ ) > <!ELEMENT not ( expr ) > <!ELEMENT gt ( expr , expr ) > <!ELEMENT gte ( expr , expr ) > <!ELEMENT eq ( expr , expr ) > <!ELEMENT ls ( expr , expr ) > <!ELEMENT lse ( expr , expr ) > Then, Jim's example would become <where> <op> <eq> <expr> <prop>dav:getcontenttype</prop> <expr> <expr> <const>image/gif</const> </expr> </eq> </op> </where> Guess what. We STILL have the two extra tags, one enclosing each operator, and one enclosing each operand. However, now the extra tags are completely useless. In contrast, in my proposal, the two extra tags confirmed or expressed type information. (They expressed the data type returned by operators. Among other things, this is valuable for three valued elimination when your collection has no idea what the operator is. For properties and constants, they simply confirmed the type.) If you tried to get rid of the two extra tags, you would have the infinite case explosion for all N-ary operators, e.g., AND, and OR, not to mention future operators e.g. addition and multiplication. And you would have a finite case explosion for operators of fixed arity that increases exponentially as new operators are added. >> So, the two extra tags are now exposed not as an artifact of type checking, but as an artifact of XML DTD syntax. << The above grammar allows lots of queries that are nonsensical on the face of it to be constructed. For example, WHERE 3.1417 AND ( "cow" > "19980406T" ) and the situation will only get worse as new operators are defined. And, yet, there are no compensating advantages over my proposal. So, I don't think the above grammar is useful. I believe that any proposal for the "where" condition that is not fully recursive is a throwaway grammar. That is, it will have to be in effect obsoleted by a second, different grammar in a follow on release of DASL. In my opinion, such a crippled grammar would not be a worthy effort, since Boolean expressions have been on the well beaten path for decades. I don't buy the theory that my proposal is too complicated for people to understand because of the two extra tags you can't get rid of. I believe they are at worst a bit annoying. I don't believe there is any issue of people not understanding data types. I believe that every programmer knows full well the difference between integers, reals, dates, etc. My girlfriend will never program a computer, but she knows how to add numbers, and she knows dates and times perfectly well. Kids start learning small integers and the days of the week in kindergarden. I propose we simply challenge anyone who is sufficiently annoyed by the two extra tags to try to get rid of them, and yet still end up with a fully recursive grammar and N-ary operators. I would really like to see it done using the current XML DTD syntax, but I don't see how. We can't lose this challenge. Either the challenger gets stumped and has to admit that it's not possible, or the challenger gives us a grammar without the extra tags. The alternative is to (1) eliminate all N-ary operators (which will make query conditions more verbose), (2) to live with the resulting finite case explosion, and (3) to live with the facts that, when new operators are added, (a) the definitions of old operators are modified, and (b) the case explosion is aggravated exponentially. --- Let me wrap this e-mail up with the observation that my proposal isn't really about type checking -- it's about correctly constructed queries. Any type checking that happens as a side effect of checking validity simply guarantees that the types in the query are proper relative to the query, and this is independent of any schema on any server. Some people are confusing the concept of correctly constructed queries and run time type checking on the server. I have been proposing that, by definition, a query is constructed correctly if and only if the XML document is valid. Since property names are just parseable character data, they may or may not happen to match a property on any resource or any particular resource on the server. That's OK, and it has nothing to do whatsoever with the query being constructed correctly by my definition. As far as literal constants, I propose we simply put words in the spec. to indicate the proper syntax for dates (steal this from WebDAV), integers, and reals, etc. This is the proper way to approach these things. I read that somewhere about XML, i.e., that it is unreasonable to try to completely specify the syntax of literal constants in XML DTD syntax. DASL is forced to design for the case that that people are guessing the property names, and that the resources in the collection are heterogeneous, i.e., that the set of defined properties is not the same for all resources. That is because there is nothing in WebDAV or DASL that prohibits these two things from happening. So, we use ANSI standard three valued logic. If a property is not defined on a particular resource currently under scan, we treat it as if it were defined, but its value were null. And id the query has the concept that the property is supposed to be of a certain type, but it is in fact of a different type on the current resource under scan, then we also treat the property value as if it were null for purposes of evaluating the "where" condition. That is, we use ANSI standard three valued logic, and if the "where" condition evaluates to TRUE, we include the resource in the answer set. If it evaluates to FALSE or UNKNOWN, we do not. (And for null or undefined properties that are explicitly selected, we return an embedded error a la WebDAV in the result set, as mentioned previously. For "order by", null collates before any actual value, as per ANSI standard.) --- Alan Babich > -----Original Message----- > From: Jim Davis [SMTP:jdavis@parc.xerox.com] > Sent: May 29, 1998 3:26 PM > To: www-webdav-dasl@w3.org > Subject: design issue: query type checking via DTD > > This message attempts to summarize a design issue that has been > discussed at some length in this forum, but not yet resolved. This > message summarizes the design issue and the arguments that have thus > far been advanced for and against it. > > Appropriate responses are: > * The design issue is not correctly stated, but should instead be ... > * An Additional argument (pro or con) is ... > * Argument X is misstated, it should be ... > > Issue: should the DASL DTD for query be defined such that one can do > type-checking on a query simply by validating the query against the > DTD? > > Context: This issue is, in some sense, a followon to an earlier > issue summarized in "Design issue: polymorphism". That issue was > whether DASL operators should be polymorphic. There, the chief > argument against such polymorphism was that it prevented type > checking. We now have a proposed DTD (see Alan Babich's email of > Sun, 10 May 1998 18:45:18 PDT) that is polymorphic while still > allowing type checking, so that particular design issue is moot. > While some have objected to the complexity of that DTD, its claimed > advantage is that it supports type checking. This raises the > logically prior issue of whether a DTD *should* support type > checking. If we agree that it should, we can then debate specific > proposed DTDs that do that. If we agree that it needn't (or > shouldn't) then we should consider different DTDs. > > Arguments are labelled P for Pro and C for Con. > > P1. It makes error checking easy, because it suffices to use a > validating XML parsers, which is easily obtained. > > P2. It enables clients to do client side checking, which is good > because you get a better UI by detecting errors as early as possible. > Advanced UIs are likely to require strong typing anyway, when > prompting for values and accepting input (e.g. to accept only > correctly formatted dates or integers.) > > [This is really an argument for strongly typed queries in general, > not for the specific approach of using a DTD for such strong > typing.] > > P3. It relies only on core XML features, as opposed to extensions or > conventions such as XML-Data that are still being designed. > > C1. It makes the query syntax more complicated. Compare > > <where> > <boolean_op> > <eq> > <string_expr> > <string_prop>dav:getcontenttype</string_prop> > <string_expr> > <string_expr> > <string_const>image/gif</string_const> > </string_expr> > </eq> > </boolean_op> > </where> > > with > > <where> > <eq> > <prop>dav:getcontenttype</prop> > <string_const>image/gif</string_const> > </eq> > </where> > > > C2. Servers can't rely on clients to do the syntax checking anyway, as > some clients won't bother, and others may be either buggy or malicious > > C3. A DTD may not be sufficient for type-checking, because WebDAV > properties > are not typed. > > C4. A DTD is not sufficient for all error checking, since for example > a property might not be defined on a given resource. >
Received on Monday, 1 June 1998 23:49:42 UTC