RE: design issue: query type checking via DTD from Babich, Alan on 1998-06-02 (www-webdav-dasl@w3.org from April to June 1998)

From: Babich, Alan <ABabich@filenet.com>
Date: Mon, 1 Jun 1998 20:47:21 -0700
To: "'Jim Davis'" <jdavis@parc.xerox.com>, www-webdav-dasl@w3.org
Cc: "Babich, Alan" <ABabich@felix.filenet.com>
Message-ID: <72B1992276A9D111A20E00805FEAC96DCC40B3@cm-expo1.filenet.com>
In my previous e-mail, I discussed and
invalidated C2 through C4 as arguments
against client side error checking.
In this e-mail, as promised in my previous
e-mail, I discuss C1. C1 is:

"C1. It makes the query syntax more complicated.  Compare

<where>
  <boolean_op>
    <eq>
      <string_expr>
        <string_prop>dav:getcontenttype</string_prop>
      <string_expr>
      <string_expr>
        <string_const>image/gif</string_const>
      </string_expr>
    </eq>
  </boolean_op>
</where>

with 

<where>
    <eq>
      <prop>dav:getcontenttype</prop>
      <string_const>image/gif</string_const>
    </eq>
</where>"


I wouldn't use the word "complicated" to discuss
what happened. I would use some word like
"undesirably verbose" or "has redundant tags".

Furthermore, I wouldn't blame it on the proposal.
I would blame it on the XML DTD language.
The problem with the XML DTD language is that
it is trying to do one too many favors for you.
It does you the "favor" of automatically
"inserting" tags. The other side of that
coin is that you loose control of which
tags have to appear in the source document
to be valid.

However, nit picking aside, there is 
still a real difference in verbosity of the 
queries. My proposal results in your
taking the query you want to have, and then
bracketing every occurrence of an operator 
with operator result-type tags, and also
bracketing every operand with expression
type tags.

In the above example, the <eq> operator
is enclosed in <boolean_op> brackets,
and its two operands are each enclosed in
<string_expr> tags.

These extra tags are caused by a bit of
factoring in the DTD. I would not have
done that if there were a reasonable way
to avoid these extra tags. However, I 
believe there is not. Let us see what 
happens if we try to eliminate the extra 
tags.


The "where" condition isn't too bad.
It becomes:

<!ELEMENT where ( and | or | not | 
                  gt | gte | eq | ne | ls | lse ) >

Every time you add a new Boolean operator,
you have to extend the list of alternatives.

Let's take the first operator, "and".

In my proposal, it is

<!ELEMENT and ( boolean_expr , boolean_expr+ ) >

and boolean_expr is

<!ELEMENT boolean_expr ( boolean_op | boolean_prop | boolean_const ) >
<!ELEMENT boolean_prop ( #PCDATA ) >
<!ELEMENT boolean_const ( "t" | "f" | "unknown" ) >

Note that we have an N-ary AND operator,
which is good, and we NEVER have to change
the definition of the AND operator, no
matter what new base data types, new Boolean
operators, new Boolean constants, or new
operators returning other types we invent
now or in the future.

In order to get rid of the extra tags, we
have to eliminate <boolean_expr> and <boolean_op>.
Then, the AND operator becomes

<!ELEMENT and (
    ( and , and ) | 
    ( and , or ) | 
    ( and , not ) | 
    ( and , gt ) | 
    ( and , gte ) | 
    ( and , eq ) | 
    ( and , ne ) | 
    ( and , ls ) | 
    ( and , lse ) |
    ( and , boolean_prop ) |
    ( and , boolean_const ) |
    ( or , and ) |
    ( or , or ) | 
    ( or , not ) | 
    ( or , gt ) | 
    ( or , gte ) | 
    ( or , eq ) | 
    ( or , ne  ) | 
    ( or , ls ) | 
    ( or , lse ) |
    ( or , boolean_prop ) |
    ( or , boolean_const ) |
    ( gt , and ) |
    ( gt , or ) | 
    ( gt , not ) | 
    ( gt , gt ) | 
    ( gt , gte ) | 
    ( gt , eq ) | 
    ( gt , ne  ) | 
    ( gt , ls ) | 
    ( gt , lse ) |
    ( gt , boolean_prop ) |
    ( gt , boolean_const ) |
    ( gte , and ) |
    ( gte , or ) | 
    ( gte , not ) | 
    ( gte , gt ) | 
    ( gte , gte ) | 
    ( gte , eq ) | 
    ( gte , ne  ) | 
    ( gte , ls ) | 
    ( gte , lse ) |
    ( gte , boolean_prop ) |
    ( gte , boolean_const ) |
    ( eq , and ) |
    ( eq , or ) | 
    ( eq , not ) | 
    ( eq , gt ) | 
    ( eq , gte ) | 
    ( eq , eq ) | 
    ( eq , ne  ) | 
    ( eq , ls ) | 
    ( eq , lse ) |
    ( eq , boolean_prop ) |
    ( eq , boolean_const ) |
    ( ne , and ) |
    ( ne , or ) | 
    ( ne , not ) | 
    ( ne , gt ) | 
    ( ne , gte ) | 
    ( ne , eq ) | 
    ( ne , ne  ) | 
    ( ne , ls ) | 
    ( ne , lse ) |
    ( ne , boolean_prop ) |
    ( ne , boolean_const ) |
    ( ls , and ) |
    ( ls , or ) | 
    ( ls , not ) | 
    ( ls , gt ) | 
    ( ls , gte ) | 
    ( ls , eq ) | 
    ( ls , ne  ) | 
    ( ls , ls ) | 
    ( ls , lse ) |
    ( ls , boolean_prop ) |
    ( ls , boolean_const ) |
    ( lse , and ) |
    ( lse , or ) | 
    ( lse , not ) | 
    ( lse , gt ) | 
    ( lse , gte ) | 
    ( lse , eq ) | 
    ( lse , ne  ) | 
    ( lse , ls ) | 
    ( lse , lse ) |
    ( lse , boolean_prop ) |
    ( lse , boolean_const ) |
    ( boolean_prop , and ) |
    ( boolean_prop , or ) | 
    ( boolean_prop , not ) | 
    ( boolean_prop , gt ) | 
    ( boolean_prop , gte ) | 
    ( boolean_prop , eq ) | 
    ( boolean_prop , ne  ) | 
    ( boolean_prop , ls ) | 
    ( boolean_prop , lse ) |
    ( boolean_prop , boolean_prop ) |
    ( boolean_prop , boolean_const ) |
    ( boolean_const , and ) |
    ( boolean_const , or ) | 
    ( boolean_const , not ) | 
    ( boolean_const , gt ) | 
    ( boolean_const , gte ) | 
    ( boolean_const , eq ) | 
    ( boolean_const , ne  ) | 
    ( boolean_const , ls ) | 
    ( boolean_const , lse ) |
    ( boolean_const , boolean_prop ) |
    ( boolean_const , boolean_const ) |

etc., etc., etc.

I have carried out the expansion to only
2-ary AND operators. For 3-ary AND operators,
there are 11 cubed more alternatives (1,331).
For 4-ary AND operators, there are 11 to the
4th power more alternatives (14,641).
For 5-ary AND operators, there are 11 to the
5th power more alternatives (161,051).
Etc. 

Clearly, at some value of N, this
becomes unreasonable (in case you don't already
think it's unreasonable). Even if you don't think
it's unreasonable, for large enough N, you
would blow out the capacity of any parser,
and the patience of any end user waiting
while this stuff is transmitted over the
network during discovery of query capabilities.
So, it actually is unreasonable.

Furthermore, every time you add a new
operator, you increase the number that is being
raised to the Nth power. For example,
suppose we add IS_NUL and PROPERTY_IS_DEFINED.
That's two more Boolean operators.
So, instead of 11**2 + 11**3 + 11**4 + ...
we have 13**2 + 13**3 + 13**4 + ... alternatives.
(Here "**" indicates exponentiation.)
In contrast, remember, that in my proposal, 
the AND operator is totally unaffected when 
new Boolean operators are added.

And that's just the AND operator. The OR
operator has just as many cases. And the AND
and OR operators are not even polymorhpic.
The six comparison operators ARE polymorhpic.

In order to make the grammar finite, you
MUST eliminate N-ary operators. But
you are still left with a case explosion
that gets worse as you add new operators.
And adding new operators affects the
definition of old operators.

Since infinitely large grammars are not 
feasible, I made the proposal that I did. 
I avoided factoring real and integer expressions 
by not inventing a tag for "numeric expression".
That cost 4 cases as opposed to 1 in all of
the six comparison operators. I thought that
was a reasonable tradeoff, since the
comparison operators are binary, not N-ary,
and we ended up with only 6 cases (instead
of 3).

So, I conclude that the XML language forces you 
to do a certain amount of factoring in most
situations. So be it. Why worry about it? 
What's the big deal? Remember, the DTD is
important too. It not only has to be processed
by software, it has to be read and understood by 
mere mortals. It is also important to keep the DTD
clear and compact.

So, unlike C2 through C4, C1 correctly (if you
allow me to reword it properly) asserts that 
you end up with more tags than you wish you
had, but, given that you want to define correctly
constructed queries as being synonymous with
syntactically valid XML documents, you have
to have the extra tags.

So, let us explore that definition.

There is a difference between syntactic and
semantic errors. Syntactic errors can be
discovered by merely looking at the query.
Many semantic errors must take other knowledge
(e.g., the schema of the collection and
the intent of the user) into account, and
can only be discovered at runtime.

For example, suppose we have a multiplication
operator. (I bet we will at some time in the
future, but maybe not in DASL 1.0.) Then,
if you multiply two dates, then, regardless
of the schema of the collection, that is
an incorrectly constructed query. My proposal
obviously would catch that error. It is not 
necessary to even let a user type in such a query,
even if you know nothing about the schema
of the collection. It would be valuable for
the UI to prevent the user from entering
such a nonsense query. My proposal would catch
a host of other errors as well. For example,
my proposal insures that the outermost
operator returns a Boolean result.

I think it would be very illuminating to
see what would happen if you didn't care
about the types of expressions. In that case, 
the following simple grammar would arise:

<!ELEMENT where ( op ) >

<!ELEMENT op ( and | or | not | 
               gt | gte | eq | ne | ls | lse ) >

<!ELEMENT expr ( op | prop | const ) >
<!ELEMENT prop ( #PCDATA ) >
<!ELEMENT const ( #PCDATA ) >

<!ELEMENT and ( expr , expr+ ) >
<!ELEMENT or ( expr , expr+ ) >
<!ELEMENT not ( expr ) >
<!ELEMENT gt ( expr , expr ) >
<!ELEMENT gte ( expr , expr ) >
<!ELEMENT eq ( expr , expr ) >
<!ELEMENT ls ( expr , expr ) >
<!ELEMENT lse ( expr , expr ) >

Then, Jim's example would become

<where>
  <op>
    <eq>
      <expr>
        <prop>dav:getcontenttype</prop>
      <expr>
      <expr>
        <const>image/gif</const>
      </expr>
    </eq>
  </op>
</where>

Guess what. We STILL have the two extra tags,
one enclosing each operator, and one enclosing
each operand. However, now the extra tags are
completely useless. In contrast, in my proposal, 
the two extra tags confirmed or expressed 
type information. (They expressed the data type
returned by operators. Among other things, this
is valuable for three valued elimination when your 
collection has no idea what the operator is.
For properties and constants, they simply 
confirmed the type.)

If you tried to get rid of the two extra
tags, you would have the infinite case explosion
for all N-ary operators, e.g., AND, and OR,
not to mention future operators e.g. addition
and multiplication. And you would have a
finite case explosion for operators of
fixed arity that increases exponentially
as new operators are added.

>> So, the two extra tags are now exposed not as an
artifact of type checking, but as an artifact
of XML DTD syntax. <<

The above grammar allows lots of queries that
are nonsensical on the face of it to be
constructed. For example, 

    WHERE 3.1417 AND ( "cow" > "19980406T" )

and the situation will only get worse as
new operators are defined. And, yet, there 
are no compensating advantages over my proposal.
So, I don't think the above grammar is useful.

I believe that any proposal for the "where" 
condition that is not fully recursive is a 
throwaway grammar. That is, it will have to 
be in effect obsoleted by a second, different 
grammar in a follow on release of DASL. In my 
opinion, such a crippled grammar would not be a 
worthy effort, since Boolean expressions have 
been on the well beaten path for decades.

I don't buy the theory that my proposal is 
too complicated for people to understand because 
of the two extra tags you can't get rid of.
I believe they are at worst a bit annoying.

I don't believe there is any issue of people 
not understanding data types. I believe that
every programmer knows full well the difference
between integers, reals, dates, etc.
My girlfriend will never program a computer,
but she knows how to add numbers, and she 
knows dates and times perfectly well. Kids
start learning small integers and the 
days of the week in kindergarden.

I propose we simply challenge anyone who is 
sufficiently annoyed by the two extra tags
to try to get rid of them, and yet still end 
up with a fully recursive grammar and N-ary
operators. I would really like to see it 
done using the current XML DTD syntax, 
but I don't see how. We can't lose this 
challenge. Either the challenger gets 
stumped and has to admit that it's not 
possible, or the challenger gives us
a grammar without the extra tags.

The alternative is to (1) eliminate
all N-ary operators (which will make query
conditions more verbose), (2) to live with
the resulting finite case explosion, and 
(3) to live with the facts that, when new 
operators are added, (a) the definitions of 
old operators are modified, and 
(b) the case explosion is aggravated 
exponentially.

                   ---

Let me wrap this e-mail up with the
observation that my proposal isn't really
about type checking -- it's about correctly
constructed queries. Any type checking that
happens as a side effect of checking validity
simply guarantees that the types in the 
query are proper relative to the query, and
this is independent of any schema on any server.

Some people are confusing the concept of 
correctly constructed queries and run time 
type checking on the server. I have been
proposing that, by definition, a query is 
constructed correctly if and only if the XML 
document is valid. 

Since property names are just parseable 
character data, they may or may not happen 
to match a property on any resource or 
any particular resource on the server.
That's OK, and it has nothing to do
whatsoever with the query being constructed
correctly by my definition.

As far as literal constants,
I propose we simply put words in the
spec. to indicate the proper syntax
for dates (steal this from WebDAV),
integers, and reals, etc. This is
the proper way to approach these things.
I read that somewhere about XML, i.e., that 
it is unreasonable to try to completely 
specify the syntax of literal constants 
in XML DTD syntax.

DASL is forced to design for the case that
that people are guessing the property names,
and that the resources in the collection
are heterogeneous, i.e., that the set of defined
properties is not the same for all resources.
That is because there is nothing in WebDAV
or DASL that prohibits these two things
from happening.

So, we use ANSI standard three valued logic.
If a property is not defined on a particular
resource currently under scan, we treat it
as if it were defined, but its value were
null. And id the query has the concept that
the property is supposed to be of a certain
type, but it is in fact of a different type
on the current resource under scan, then
we also treat the property value as if
it were null for purposes of evaluating the
"where" condition. That is, we use ANSI
standard three valued logic, and if the 
"where" condition evaluates to TRUE,
we include the resource in the answer set.
If it evaluates to FALSE or UNKNOWN, we
do not. (And for null or undefined properties
that are explicitly selected, we return
an embedded error a la WebDAV in the result set,
as mentioned previously. For "order by",
null collates before any actual value,
as per ANSI standard.)

                             ---
Alan Babich


> -----Original Message-----
> From:	Jim Davis [SMTP:jdavis@parc.xerox.com]
> Sent:	May 29, 1998 3:26 PM
> To:	www-webdav-dasl@w3.org
> Subject:	design issue: query type checking via DTD
> 
> This message attempts to summarize a design issue that has been
> discussed at some length in this forum, but not yet resolved.  This
> message summarizes the design issue and the arguments that have thus
> far been advanced for and against it.
> 
> Appropriate responses are:
>  * The design issue is not correctly stated, but should instead be ...
>  * An Additional argument (pro or con) is ...
>  * Argument X is misstated, it should be ...
> 
> Issue: should the DASL DTD for query be defined such that one can do
> type-checking on a query simply by validating the query against the
> DTD?
> 
>   Context: This issue is, in some sense, a followon to an earlier
>   issue summarized in "Design issue: polymorphism".  That issue was
>   whether DASL operators should be polymorphic.  There, the chief
>   argument against such polymorphism was that it prevented type
>   checking.  We now have a proposed DTD (see Alan Babich's email of
>   Sun, 10 May 1998 18:45:18 PDT) that is polymorphic while still
>   allowing type checking, so that particular design issue is moot.
>   While some have objected to the complexity of that DTD, its claimed
>   advantage is that it supports type checking.  This raises the
>   logically prior issue of whether a DTD *should* support type
>   checking.  If we agree that it should, we can then debate specific
>   proposed DTDs that do that.  If we agree that it needn't (or
>   shouldn't) then we should consider different DTDs.
> 
> Arguments are labelled P for Pro and C for Con.
> 
> P1. It makes error checking easy, because it suffices to use a
> validating XML parsers, which is easily obtained.
> 
> P2. It enables clients to do client side checking, which is good
> because you get a better UI by detecting errors as early as possible.
> Advanced UIs are likely to require strong typing anyway, when
> prompting for values and accepting input (e.g. to accept only
> correctly formatted dates or integers.)
> 
>   [This is really an argument for strongly typed queries in general,
>    not for the specific approach of using a DTD for such strong
>    typing.]
> 
> P3. It relies only on core XML features, as opposed to extensions or
> conventions such as XML-Data that are still being designed.
> 
> C1. It makes the query syntax more complicated.  Compare
> 
> <where>
>   <boolean_op>
>     <eq>
>       <string_expr>
>         <string_prop>dav:getcontenttype</string_prop>
>       <string_expr>
>       <string_expr>
>         <string_const>image/gif</string_const>
>       </string_expr>
>     </eq>
>   </boolean_op>
> </where>
> 
> with 
> 
> <where>
>     <eq>
>       <prop>dav:getcontenttype</prop>
>       <string_const>image/gif</string_const>
>     </eq>
> </where>
> 
> 
> C2. Servers can't rely on clients to do the syntax checking anyway, as
> some clients won't bother, and others may be either buggy or malicious
> 
> C3. A DTD may not be sufficient for type-checking, because WebDAV
> properties
> are not typed.
> 
> C4. A DTD is not sufficient for all error checking, since for example
> a property might not be defined on a given resource.
>
Received on Monday, 1 June 1998 23:49:42 UTC