Re: [xml-dev] The use of XML syntax in XML Query

At 08:04 AM 1/4/2002 -0800, Dare Obasanjo wrote:
>----- Original Message -----
>From: "Jonathan Robie" <jonathan.robie@softwareag.com>
>To: "Dare Obasanjo" <kpako@yahoo.com>; "Champion, Mike"
><Mike.Champion@SoftwareAG-USA.com>; <www-xml-query-comments@w3.org>;
><xml-dev@lists.xml.org>
>Cc: <suciu@cs.washington.edu>
>Sent: Thursday, January 03, 2002 4:26 PM
>Subject: Re: [xml-dev] The use of XML syntax in XML Query
>
> > 2. Typechecking without type inference can work fine as long as you do not
> > have joins. If joins are involved, typechecking becomes undecidable.
>
>So what happens with XQuery joins ( http://www.w3.org/TR/xquery/#id-joins )
>then? Will there be a caveat in the recommendation that indicates that static
>typechecking can't be done w.r.t joins?

What this means is that we need to use type inference for static 
typechecking in XQuery, because without it, typechecking becomes 
undecideable. Fortunately, type inference is precisely the approach we take.

> > 3. Type inference is the most promising approach, but it does lead to some
> > false negatives. He gives an example using a content model involving equal
> > cardinality among three different elements in sequence. The schema for this
> > can not be expressed in DTDs or in XML Schema, so it is not clear to me
> > that this is a real limitation, but I just read this, and I need to do some
> > thinking before I would want to draw a strong conclusion based on his
> > examples.
>
>But his example is fairly simple and utilizes none of the really complex
>abilities of XML schemas.

His example involves

<result>
{
   for $x in document("doc.xml")//elm return <a>{ $x/text() }</a>
   for $x in document("doc.xml")//elm return <b>{ $x/text() }</b>
   for $x in document("doc.xml")//elm return <c>{ $x/text() }</c>
}
</result>

In most current systems, the inferred content model of the result element is:

a*, b*, c*

Now suppose the DTD we require for our output is this:

(((a,a)*, (b,b)*, (c,c)*) | ((a,a)*,a, (b,b)*,b, (c,c)*,c))

In this content model, either a,b, and c all occur an even number of times, 
or they all occur an odd number of times. Dan points out that the inferred 
content model is not a subset of the required content model, so static type 
checking would not be able to guarantee the required output. Incidentally, 
you don't need as complicated a required type as Dan uses to get this 
result, the following would suffice:

((a,a), (b,b), (c,c))

Here's my take on this:

First off, we know that static typechecking is conservative. If a query 
fails static typechecking, that does not mean it will not generate a 
correct result, it means only that it can not be guaranteed to generate a 
correct result. That's one of the reasons that static typechecking is 
optional in XQuery. In this case, a human being can read the query and see 
that a more precise type could have been inferred, and this more precise 
type could have been guaranteed. One of the things we spend time on in the 
XQuery Working Group is examining the kinds of type information that might 
be computed for an expression, and trying to make it as precise as 
possible. There will almost certainly be cases like the one Dan points out 
where our static typechecking is less precise than you would prefer. The 
example he gives is rather artificial. He says that further work is needed 
to see whether false negatives like this will turn out to be a practical 
problem - in other words, will they be so frequent that people will ignore 
the useful results of static typechecking?

Looking at the way most schemas and DTDs are written, I don't think that 
Dan's example is going to be so frequent that it would cause real problems, 
and the type inference associated with most expressions makes sense to me. 
But the best way to answer Dan's question, IMHO, is to get more 
implementation experience with the static type system, and there are a few 
more wrinkles that I would like to see fixed in the very near future. 
People need to play with real queries with real DTDs and schemas and see 
how useful the static type checking is for them in practice. The theorists 
have spent a lot of time creating this system, but we have not had much of 
a chance to kick the tires. I have spent a little time kicking the tires by 
looking at the queries I write and the DTDs or schemas associated with 
them, and my initial impression is that static typechecking will be very 
useful.

>For instance, how can static typechecking work for
>schemas that use identity constraints like xs:unique? I fail to see how one
>can guarantee that the following expression
>
>f = <results>
>FOR $x in /employee/age
>RETURN
>   <employee-id>{
>    $x * 13
>   }</employee-id>
></results>
>
>will only return unique values for <employee-id> via type inference.

Static typechecking involves structure, not values. Uniqueness constraints 
are defined in terms of values. Static typechecking also ignores the facets 
of simple types in XML Schema, which can also result in errors not detected 
by the static type system. But practical static type systems that also take 
values into account are well beyond the state of the art.

> > 4. Dan concludes that type inference is still the most promising approach
> > to static typechecking, but that further work needs to be done on its
> > applicability and limitations.
> >
> > My own take on this is that static type checking using type inference is
> > very promising, and seems to work well in theory for the kinds of queries I
> > have looked at. An implementation using the current type system did catch
> > interesting errors for me. I think our type system needs further work, and
> > we need more practical experience using implementations that do static type
> > checking. This is one of the highest priorities for me personally.
>
>I'm interested in the type of queries you've looked at. I'm not convinced that
>this is as straightforward a problem as you've implied but readily admit that
>my theoretical CS skills are nowhere near excellent so you may be right and
>all I need is a little convincing.

1. Mismatches in content models

Suppose the input DTD uses the following definition of author:

         <!ELEMENT author ( #PCDATA)>

The output DTD uses the following definition of author:

         <!ELEMENT author (first, last)>

The following query does not validate according to the output DTD, and the 
static type system will catch this:

         //author

Any number of common errors can result in mismatches in content models - 
mistyping the name of an element or forgetting which name is used for it in 
the DTD or schema, putting elements in the wrong order - basically, the 
same kinds of errors you typically get when validating with a DTD or 
schema, except that the static type system can not detect things related to 
values such as identity constraints or facets of XML Schema simple types.

2. Depending on optional content

Suppose we have the following employee element in our input DTD:

         <!ELEMENT employee (name, salary, percent-bonus?)>

Our output DTD shows the computed bonus for each employee, which may be 
zero, but must be present:

         <!ELEMENT employee (name, bonus)>

Now consider the following query:

         for $e in //employee
         return
           <employee>
            {
              $e/name,
                $e/salary * $e/percent-bonus
            }
           </employee>

Oops, that has the wrong structure, because salary*bonus evaluates to a 
value, not to an element. Fortunately, as we showed in (1), the static type 
system catches this error. Let's try again:

         for $e in //employee
         return
           <employee>
            {
              $e/name,
                <bonus>{  $e/salary * $e/percent-bonus }</bonus>
            }
           </employee>

Suppose we have a schema, rather than a DTD, and bonus is defined as a 
decimal value which may not be null. In XQuery, if percent-bonus evaluates 
to an empty sequence, then the product of salary and the empty sequence is 
an empty sequence. For any employee that did not have a percent-bonus in 
the input, there is no bonus in the output, so this does not pass the 
static type check. Let's fix the query using if-empty() to supply a default 
value of zero for percent-bonus:

         for $e in //employee
         return
           <employee>
            {
              $e/name,
                <bonus>{  $e/salary * if-empty($e/percent-bonus, 0) }</bonus>
            }
           </employee>

That now passes static type checking. Incidentally, note that it is quite 
possible that we would have missed the second error in our testing, because 
our test files might have supplied percent-bonus for all employees. A 
customer site, using data corresponding to the same DTD, could have 
encountered the error we missed. Static type checking caught the error 
without the need for an adequate test to expose this dependency.


3. Forgetting the curly braces

Suppose I forgot the curly braces in the above example:

         <bonus> $e/salary * if-empty($e/percent-bonus, 0) </bonus>

This element constructor has string content, which does not match the type 
of a bonus element in my schema.

4. Confusing a URL with the resource to which it points

This is an error I made several times while developing the RDF library for 
the Syntactic Web paper. Some of my functions returned URLs, others 
returned RDF descriptions, and sometimes I changed the return type of a 
function without remembering to change all function calls using that 
function. Since a URL and a resource have different structure, static type 
checking catches this error.

I hope this gives the basic flavor. I think it this is a very useful degree 
of error checking, and these are errors that our current static type system 
already reports.

Jonathan

Received on Friday, 4 January 2002 10:32:43 UTC