RE: Type checking in DASL from Babich, Alan on 1998-04-30 (www-webdav-dasl@w3.org from April to June 1998)

From: Babich, Alan <ABabich@filenet.com>
Date: Wed, 29 Apr 1998 17:23:29 -0700
To: "'Saveen Reddy (Exchange)'" <saveenr@Exchange.Microsoft.com>, "'www-webdav-dasl@w3.org'" <www-webdav-dasl@w3.org>
Cc: "Babich, Alan" <ABabich@felix.filenet.com>
Message-ID: <72B1992276A9D111A20E00805FEAC96D0FD65A@cm-expo1.filenet.com>
Saveen:

Regarding your comment "ITEM 4 - I don't see
the benefit of having n "greater than"
operators ...", I would like to look a little 
more deeply into the problem, if I may.

A query engine will execute the query on a
server. Let us consider some of the detailed
ramifications of that.

(1) Let us consider, for the moment, only comparison
between like types, i.e., integer > integer,
string > string, datetime > datetime, etc.

Even for this case of homogeneous types,
the comparison routines are completely different.
The comparison routines for two integers have
nothing in common with the comparison routines
for two strings. The string comparison routines
are much more complicated, because for strings
there are character set translation issues,
collating sequence issues that arise from
localization, and case sensitivity or
insensitivity issues. So, the query engine
is forced to discriminate which case pertains,
so that it can call the correct comparison
routine.

For polymorphic "greater than", the query 
engine must decide which comparison
routine to call based on the type of the
operands. This is more complicated than
necessary: Obviously, the query engine has
to have the equivalent of a C "switch"
statement on the operator regardless of
whether the operators are polymorphic or not.
The point is that it could get the determination
of which comparison routine to call "for free"
from the switch statement on the operator if
the operators were NOT polymorphic. 

If the "greater than" operator were polymorphic,
the query engine would case on the operator
and end up in the case for "greater than",
and then it would have to have a second switch
statement on the operand types once it got
there. The implementation cost
and the run time cost of the extra switch
statement is undesirable. There is
no development or run time cost to making
the "greater than" operator non polymorphic
in the spec. There are only a few more lines
of element declarations.


(2) Now let us consider the case of mixed
types.

Heterogeneous types could simply be disallowed out
of hand as a syntax error, but let us take the 
polymorhpic approach further toward its logical 
conclusion, since were are looking more deeply
into the polymorphic operator issue.

It turns out that comparisons of mixed types
can sometimes make sense, even good sense.

Obviously { 3 > 4.01 } can be considered a
valid comparison. (The two different types
involved here are integer and floating point.)

Let us consider another example: Consider
{ integer > string }. 

It turns out that comparisons with these 
types (and comparisons
of type { string > integer}) can make
sense, depending upon the value of the
string. Suppose the string value is "cow",
and the integer value is 3. Then, almost
everyone would agree that { 3 > "cow" }
should be considered an illegal comparison.

However, suppose the value of the integer 
is 3, and the value of the string is
"  3,000,000.01 E -04 ". One could conceivably
take the position that the comparison should be
allowed, because by following some
standard floating point notational conventions,
the string could be converted into a floating point
number, and the integer could be promoted to a
floating point number, and comparing the two values
makes sense.  In Europe, however,
the role of comma and period are interchanged,
so the string is not in a valid numerical format,
and the comparison would be illegal.

Note that the integer could be a property, or it 
could be a literal. Also, the string could be a property
or it could be a literal. Any of the four possible
combinations of property and literal could possibly
be legal or illegal, depending on the VALUE of the string
property or literal, not on its being of type string.

I know of existing systems that disallow the comparison
of numbers to strings, and I know of
systems that allow the comparison of numbers to strings
if the string happens to be convertible
into a number. One approach has a speed and simplicity
advantage, and the other could be argued to have a
functionality advantage.

(3) Now let us consider error handling. Most people
believe that catching errors as soon as possible
is desirable. However, since we are looking more
deeply into the problem, let us consider the error
handling behavior in more detail.

Let us consider the case of a system
where a client PC produces a DASL query, 
sends it to a web server, which
sends it to a layer of software that converts
the DASL query to calls on the DMA (Document Management
Alliance) API, and the DMA document management
system stores its document catalog information in
a relational database.

Suppose further that we had polymorphic "greater
than". What is the DASL-to-DMA API layer
supposed to do with the { integer > string } case?
The DMA technical committee discussed the 
polymorphic query operator issue at great length, 
and decided against polymorphic operators, so
the DMA API is supposed to reject such a query. Therefore,
the implementers of the DASL-to-DMA layer would
have to decide whether it could convert the
comparison to a { integer > integer } comparison,
or a { floatingpoint > floatingpoint } comparison
before passing the query on to the DMA API.
This would only be legally possible if the string 
involved were a literal, not a property. So, sometimes
that kind of comparison works, and sometimes it
doesn't. When it doesn't, an error must be
returned. This error occurred on the server.
The DASL-to-DMA layer could have an error and
pass a malformed query to the DMA API, and the
DMA API could have an error and accept it. Then,
when the query was converted to SQL and passed
to the RDBMS, a SQL syntax error would occur.
It would be difficult to pass a clear simple
error back to the client that made the problem
clear.

Now consider the case where there is no DMA
API layer, so no conversion to it. The DASL
query would be passed on to some query engine.
Suppose the query engine was based on a
relational database engine, and it converted
the DASL query to SQL and called the RDBMS.
Well, depending upon the RDBMS, I can tell
you that the query may succeed or fail,
depending upon which RDBMS is used. If the
query fails, it would be difficult to pass
an error back to the client that would make
it clear what was going on.

Thus, when errors occur on the server,
there is always greatly increased difficulty
of making it clear to (a) the human end user,
(b) the support people he calls when things
don't work, and (c) the programmers who get
involved in the bug fixing, whether or not
the bug is in their code or some other
company's code as to what exactly went wrong.

Furthermore, the end user has to wait longer
to even get an error.

In contrast, if the DASL query is rejected
on the client PC as being malformed, it is
clear to everyone that the UI application
that the user is running has a bug.

Thus, it is clearly better to have the
non polymorphic operators I proposed, since
all of the expensive and time consuming
problems that arise when polymorphic
operators are used are precluded, and
the query engine implementations
are simpler and faster.

(4) The non polymorphic operators are
much easier to describe in XML. For
the polymorphic operators, how are you
going to express what combinations of
types are legal, and which are illegal?
You don't just want to have words in
the spec. You want the element definitions
to embody the correctness or incorrectness
of the construction of the query.
This is an industry standard. We need to be 
precise as to what is allow, and not leave holes.

(5) DMA has drilled down on this particular
rathole extensively, and, I believe, come
up with the correct conclusion -- don't
have polymorphic operators. There are only
advantages, and no real disadvantages. (I don't
consider a few more lines in the DASL
spec. for element declarations to be a
disadvantage.) I propose we leverage the
DMA work instead of going down this rathole
again with a different group of people, 
make DASL and DMA more interoperable,
simplify the implementation of the query
engines, make the query engines faster, by
having the non polymorphic operators I
proposed.

(6) I would like to not how nicely the
proposed typing extensions to XML would
dovetail with my proposal for the query
operators. Let us use the 

    <elementType id="author">
        <string/>
    </elementType>

example. My proposal for the element definitions
recurses down to properties and literals, and 
when it gets to this bottom level of the recursion, 
it in effect says "a string valued property is 
required here" or "a string literal is required here". 
With the XML extensions for typing, you could verify
syntactically whether or not all the types
match on the client side. Thus, my
proposal dovetails perfectly.

Alan Babich

> -----Original Message-----
> From:	Saveen Reddy (Exchange) [SMTP:saveenr@Exchange.Microsoft.com]
> Sent:	April 28, 1998 10:30 AM
> To:	'www-webdav-dasl@w3.org'
> Subject:	RE: Type checking in DASL
> 
> 
> First, I just wanted to clarify that XML-DATA does type elements as
> Alan's
> example shows ...
> 
>     <elementType id="author">
>         <string/>
>     </elementType>
> 
> But this is only one part of XML-DATA .. the part that provides an
> XML-based
> "DTD", and that associates a type with all tags named "author".
> XML-DATA can
> also bind a type to an element on-the-fly ...
> 
> 	<author dt:dt="string">Clark Kent</author>
> 	<author dt:dt="int">235</author>
> 
> ITEMs 1,2,3 -- I agree with these
> 
> ITEM 4 - I don't see the benefite of making n "greater-than" operators
> to
> correspond with the n types as opposed to one operator that can deal
> with n
> types. I think the latter is the natural way to do it. Just to be
> clear, I
> don't claim the either is more expressive, but that the latter is
> easier to
> explain and use.
> 
> Thanks,
> Saveen
> 
> 
> -----Original Message-----
> From: Babich, Alan [mailto:ABabich@filenet.com]
> Sent: Friday, April 17, 1998 5:40 PM
> To: Alex Hopmann (Exchange); 'Reddy, Saveen'; 'Davis, Jim'; 'Reddy,
> Surendra'; 'Henderson, Rick'; 'Jensen, Del'
> Cc: Babich, Alan; 'www-webdav-dasl@w3.org'
> Subject: Type checking in DASL
> 
> 
> This memo describes one approach to introduce 
> syntactic type checking into DASL without relying
> on XML extensions. Saveen has already pointed
> out that there is a proposed extension
> to XML for typing. It would extend the XML 
> language by having new constructs such as 
> 
>     <elementType id="author">
>         <string/>
>     </elementType>
> 
> to define an "author" conceptual type, the value 
> of which is expressed as a string (i.e., the
> datatype of "author" is "string").
> 
> In this memo, I just focus on the query condition.
> 
> This proposal attempts to just use the existing
> <!ELEMENT> declaration syntax, so that we don't
> have any dependency on the XML extension proposal.
> This proposal turns query conditions with type
> mismatches into malformed XML documents. This proposal 
> will continue to work regardless of the outcome of the
> proposal for XML datatype extensions.
> 
> ITEM 1:
> I postulate small, fixed set of interesting base
> datatypes. In this memo, I assume Boolean, integer,
> floating point, strings, and datetimes are the
> interesting base datatypes for sake of discussion.
> 
> NOTE: Even though I think we should represent
> dates as strings in ISO 8061 format using the Zulu
> time zone, I made datetimes their own datatype,
> so that we can have different operators on them,
> e.g., "today's date", etc., and so that you know
> you want to run the date string through a format
> routine before displaying it, etc.  The fact
> that SQL also makes datetimes their own type
> is worthy of note, and bolsters this argument. END NOTE
> 
> ITEM 2:
> There are at most three things that can return a
> value of a given datatype: (1) a property,
> (2) a literal, and (3) an operator. Therefore, there is
> an element type definition for expressions of each
> base datatype, and the alternatives of this element
> definition are (1) properties with that datatype,
> (2) literals of that datatype, and (3) operators
> returning that datatype (unless no operators
> are currently defined that return that datatype).
> 
> ITEM 3:
> The "where" clause consists of an operator that 
> returns a Boolean value.
> 
> ITEM 4:
> To add a new operator in the future that returns results
> of datatype x, (1) make an element definition, x_op
> for all operators that return values of datatype x
> if there isn't one already,
> (2) make an element definition for the new operator
> to show what its operands need to be, and
> add it to list of the alternatives of x_op,
> and (3) if there isn't already an alternative for x_op
> in the element definition of x_expr, add x_op
> to its list of alternatives. The x_op and x_expr
> forms are just naming conventions, and have no other
> significance. In particular, they are not
> parsed to discover "x".
> 
> Just for the heck of it, I've included a string
> length operator, "strlen" that takes a string
> argument and returns an integer, and a floating
> point addition operator, just to illustrate
> how the model extends -- not to propose them
> as part of the minimal required operator set.
> 
> 
> Here are the definitions for the query "where" condition:
> 
> <!ELEMENT where ( boolean_op ) >
> 
> <!-- These are all the operators that return a Boolean value -->
> <!ELEMENT boolean_op ( and | or | not | 
>                        gt_integer | ge_integer | 
>                        eq_integer | ne_integer | 
>                        ls_integer | le_integer |
>                        gt_float | ge_float |
>                        eq_float | ne_float |
>                        ls_float | le_float |
>                        gt_string | ge_string |
>                        eq_string | nq_string |
>                        ls_string | le_string |
>                        gt_datetime | ge_datetime |
>                        eq_datetime | ne_datetime |
>                        ls_datetime | le_datetime |
>                        contains ) >
> 
> <!ELEMENT and ( boolean_expr , boolean_expr+ ) >
> <!ELEMENT or ( boolean_expr , boolean_expr+ ) >
> <!ELEMENT not ( boolean_expr ) >
> <!ELEMENT gt_integer ( int_expr , int_expr ) >
> <!ELEMENT ge_integer ( int_expr , int_expr ) >
> <!ELEMENT eq_integer ( int_expr , int_expr ) >
> <!ELEMENT ls_integer ( int_expr , int_expr ) >
> <!ELEMENT le_integer ( int_expr , int_expr ) >
> <!ELEMENT gt_float ( float_expr , float_expr ) >
> <!ELEMENT ge_float ( float_expr , float_expr ) >
> <!ELEMENT eq_float ( float_expr , float_expr ) >
> <!ELEMENT ne_float ( float_expr , float_expr ) >
> <!ELEMENT ls_float ( float_expr , float_expr ) >
> <!ELEMENT le_float ( float_expr , float_expr ) >
> <!ELEMENT gt_string ( string_expr , string_expr ) >
> <!ELEMENT ge_string ( string_expr , string_expr ) >
> <!ELEMENT eq_string ( string_expr , string_expr ) >
> <!ELEMENT ne_string ( string_expr , string_expr ) >
> <!ELEMENT ls_string ( string_expr , string_expr ) >
> <!ELEMENT le_string ( string_expr , string_expr ) >
> <!ELEMENT gt_datetime ( datetime_expr , datetime_expr ) >
> <!ELEMENT ge_datetime ( datetime_expr , datetime_expr ) >
> <!ELEMENT eq_datetime ( datetime_expr , datetime_expr ) >
> <!ELEMENT ne_datetime ( datetime_expr , datetime_expr ) >
> <!ELEMENT ls_datetime ( datetime_expr , datetime_expr ) >
> <!ELEMENT le_datetime ( datetime_expr , datetime_expr ) >
> <!ELEMENT contains ( prop? , #PCDATA ) >
> 
> <!-- These are all the operators that return an integer value -->
> <!ELEMENT int_op ( strlen ) >
> 
> <!ELEMENT strlen ( #PCDATA ) >
> 
> <!-- These are all the operators that return a floating point value
> -->
> <!ELEMENT float_op ( add_float ) >
> 
> <!ELEMENT add_float ( float_expr , float_expr+ ) >
> 
> <!-- No operators are currently defined that return string
>      or datetime values.
> -->
> 
> <!-- The following elements define all Boolean valued expressions -->
> <!ELEMENT boolean_expr ( boolean_op | boolean_prop | boolean_const ) >
> <!ELEMENT boolean_prop ( prop ) >
> <!ELEMENT boolean_const ( TRUE , FALSE, UNKNOWN ) >
> 
> <!-- The following elements define all integer valued expressions -->
> <!ELEMENT int_expr ( int_op | int_prop | int_const ) >
> <!ELEMENT int_prop ( prop )
> <!ELEMENT int_const ( #PCDATA )
> 
> <!-- The following elements define all floating point valued
> expressions
> -->
> <!ELEMENT float_expr ( float_op | float_prop | float_const ) >
> <!ELEMENT float_prop ( prop ) >
> <!ELEMENT float_const ( #PCDATA ) >
> 
> <!-- The following elements define all string valued expressions -->
> <!ELEMENT string_expr ( string_prop | string_const ) >
> <!ELEMENT string_prop ( prop ) >
> <!ELEMENT string_const ( #PCDATA ) >
> 
> <!-- The following elements define all datetime valued expressions -->
> <!ELEMENT datetime_expr ( datetime_prop | datetime_const ) >
> <!ELEMENT datetime_prop ( prop ) >
> <!ELEMENT datetime_const ( #PCDATA ) >
> 
> 
> As an example, the query condition
> 
>       ( loan_processor = "Sam" AND loan_amount > 100000 ) OR
>       loan_risk_level > 3
> 
> We would have a parse tree that looks like this:
> 
>     OR
>         AND
>             =
>                 loan_processor
>                 "Sam"
>             >
>                 loan_amount
>                 100000
>         >
>             loan_risk_level
>             3
> 
> and XML that looks like this:
> 
> <where>
>     <boolean_op>
>         <or>
>             <boolean_op>
>                 <and>
>                     <boolean_op>
>                         <eq_string>
>                             <string_prop>
>                                 loan_processor
>                             </sring_prop>
>                             <string_const>Sam</string_const>
>                         </eq_string>
>                     </boolean_op>
>                     <boolean_op>
>                         <gt_integer>
>                             <int_prop>
>                                 loan_amount
>                             </int_prop>
>                             <int_const>1000000</int_const>
>                         </gt_integer>
>                     </boolean_op>
>                 </and>
>             </boolean_op>
>             <boolean_op>
>                 <gt_integer>
>                     <int_prop>
>                         loan_risk_level
>                     </int_prop>
>                     <int_const>3</int_const>
>                 </gt_integer>
>             </boolean_op>
>         </or>
>     </boolean_op>
> </where>      
> 
> I'm not familiar enough with XML to know whether
> we can improve on the definition of properties or constant
> types. At the bottom level I just used "prop", 
> which is defined somewhere else,
> and #PCDATA, which seems like it needs further
> syntactic definition.
> 
> What do you think?
> 
> I'll be out of town at least until Thursday, 4/23/98,
> and I'm already behind on my e-mail. I haven't even
> read all the DASL e-mail yet. So, I won't be able
> to respond before 4/23 to comments.
> 
> Alan Babich
Received on Wednesday, 29 April 1998 20:25:47 UTC