Re: XQuery and relational databases... from Jonathan Robie on 2003-03-08 (www-ql@w3.org from January to March 2003)

From: Jonathan Robie <jonathan.robie@datadirect-technologies.com>
Date: Fri, 07 Mar 2003 20:39:03 -0500
To: Jim Melton <jim.melton@acm.org>, Nitin Mangtani <nitinm@bea.com>
Cc: Michael Burbidge <mburbidg@adobe.com>, www-ql@w3.org
Message-Id: <5.2.0.9.0.20030307201918.05234eb0@ncmail.datadirect-technologies.com>
At 08:54 AM 3/7/2003 -0700, Jim Melton wrote:

>Jonathan Robie (and, to be fair, others) did a considerable amount of work 
>to try to persuade the WG of the necessity of having a grouping capability.

But I withdrew myself from the agenda when I was shown an acceptable (if 
tedious) way to do grouping with the existing XQuery language. For what 
it's worth, I think we really need use cases for grouping before we can 
design this feature right. So far, I have seen a lot of solutions to the 
grouping problem in XQuery, but the only usage scenarios I have seen were 
done by Jeni Tennison, and are available on her web site 
("http://www.jenitennison.com", I think - I am on an airplane right now and 
hence offline).

We need use cases so that we know what problems we need to solve when we 
get around to this in Level 2. They should discuss what problem the user is 
trying to solve, and show the input XML document(s) and the output XML. I 
can't emphasize enough how useful the use cases in our W3C XML Query Use 
Cases document have been in designing the language.

>Recently, Jonathan made a valiant effort to get a grouping capability into 
>the language...sort of a last-minute attempt.

I was originally convinced that we needed grouping to handle queries that 
group by multiple fields - at one point, we could do that with 
distinct-values(), which worked with complex values, but always using deep 
equality:

<results>
  {
   for $a in distinct-values(document("http://www.bn.com/bib.xml")//author)
   return
     <result>
       { $a }
       {
         for $b in document("http://www.bn.com/bib.xml")/bib/book
         where some $ba in $b/author satisfies deep-equal($ba,$a)
         return $b/title
       }
     </result>
  }
</results>

>Don Chamberlin wrote a message to Jonathan, me, and at least a few others 
>(perhaps to the entire Query WG?) that illustrated a grouping capability 
>that is quite useful in certain situations (e.g., grouping by individual 
>elements and not multiple elements.

Right - here's the solution that Don proposed, which is based on a 
suggestion from Mike Kay:

for $last in distinct-values(input()//author/last),
   $first in distinct-values(input()//author[last=$last]/first)
let $books := input()//book
      [some $ba in $books/author satisfies
        $ba/last = $last and $ba/first = $first]
return
   <author>
      <name>
        <last>{$last}</last>
        <first>{$first}</first>
      </name>
      {$books}
   </author>

That works quite generally - it's a  bit fiddly for the query author, and 
many systems may not optimize it all that well, but plenty good for Version 
1.0. It really is time to stop adding new features unless the language is 
broken without them.

>This approach is probably "good enough" for version 1, allowing us to 
>consider a more complete grouping facility in a future version of the 
>XQuery specification.

I agree. For what it's worth, here are the other options that I presented 
to the Working Group before I saw Don's proposal. My personal favorite is 
currently Option 4, but we can worry about that in Level 2. All of these 
are different from SQL's group by. Others have suggested a grouping 
facility modelled more directly on that of SQL.

========================================================================
Option 1: Restore complex distinct-values()

Obviously, one way of fixing the use case is to restore the previous
definition of distinct-values().

Advantages:
- straightforward

Disadvantages:

- deep-equal is fragile, and does not allow subitems to be ignored
- concerns about optimizability of distinct-values()
- complicated and somewhat weird definition of distinct-values()

========================================================================
Option 2: String hack

The Working Group could simply decide to use a string hack to solve
this problem in the use cases. I believe the following query solves
the use case correctly:

define function string-hack($a as element(author))*)
  as xs:string*
{
  for $aa in $a
  return
    "[[last:"+string($a/last)+"]" + "[first:"+string($a/first)+"]]"
}

<results>
  {
   let $in := document("http://www.bn.com/bib.xml")//book
   for $a in distinct-values(string-hack($in//author))
   return
     <result>
       { $a }
       {
         for $b in document("http://www.bn.com/bib.xml")/bib/book
         where some $ba in $b/author satisfies string-hack($ba)=string-hack($a)
         return $b/title
       }
     </result>
  }
</results>

Advantages:
- Gets grouping off our plate

Disadvantages:
- Complex for users
- Hard to optimize


========================================================================
Option 3: With Distinct + Quantifiers

Mike Kay has suggested an operator called "with distinct" that returns
items from a sequence that are unique with respect to a list of
expressions. For instance, the following query returns a sequence of
authors in which no two authors have the same first and last names:

  input()//author with distinct( last, first )

The syntax of this operator is:

WithDistinct ::= Expr "with" "distinct" ( Expr ("," Expr)* )

The expressions in parentheses are evaluated for each node from the
left-hand expression, using that node as the context node, to produce
a set of values. For each distinct set of values, a corresponding node
from the left-hand expression is returned. This allows us to solve the
use case as follows:

  let $bin := input()//book
  for $a in $bin//author with distinct(last, first)
  let $b := $bin[some $ba in author satisfies
                 ($ba/last=$a/last and $ba/first=$a/first)]
  return
     <author>
        <name>{ $a/last, $a/first }</name>,
        { $b/title }
     </author>

Advantages:
- with distinct "does the right thing"
- it's "the simplest thing that works"

Disadvantages
- matching books is still a little fiddly
- perhaps not easily optimizable

When I say that matching books is still a little fiddly, note that the
following is not a correct solution, for reasons that may not be
obvious to all users:

  let $bin := input()//book
  for $a in $bin//author with distinct(last, first)
  let $b := $bin[$ba/last=$a/last and $ba/first=$a/first]
  return
     <author>
        <name>{ $a/last, $a/first }</name>,
        { $b }
     </author>

========================================================================

Option 4: With Distinct + Matches

Option 3 makes it possible for users to do grouping in a meaningful
way, but is still somewhat tricky for users, and does not leave
obvious clues for query processors to know how to optimize. For the
sample query, with distinct makes it easy to find unique authors, but
we might want an easier way to find books that correspond to a given
author.

We could introduce another postfix operator, "matches", which returns
nodes that match a given node by a set of criteria. For instance, if
$a is bound to a given author, then the following expression finds
authors that match $a by last name and first name:

     input()//author matches $a by (last, first)

This can be used to solve our query as follows:

  let $bin := input()//book
  for $a in $bin//author with distinct(last, first)
  let $b := $bin[author matches $a by (last, first)]
  return
     <author>
        <name>{ $a/last, $a/first }</name>,
        { $b }
     </author>

The syntax of matches is:

Matches ::= Expr "matches" Expr "by" ( Expr ("," Expr)* )

When matches compares two nodes, it evaluates the expressions in the
parentheses for each, using that node as the context node. If the
left-hand node matches the right-hand node, it is included in the
result.

Advantages:
- with distinct "does the right thing"
- easy to explain to users
- concise
- patterns leave clues for optimizer
- each expression is useful in its own right
- no change to FLWOR expressions or tuple generation

Disadvantages
- too much for level 1?

========================================================================
Option 5: Group By

Some people prefer explicit group by. In XQuery, I think it is most
natural to do this as a separate expression, similar to the FLWOR
expression. In the following example, each group is a tuple in which
each $a is bound to one distinct author and each $b is bound to the
books corresponding to that author:

group $b := input()//book
    by $a in $b//author with distinct (last, first)
order by $a/first, $a/last
return
    <author>
     {
       <name>{ $a/last, $a/first }</name>,
       $b/title
     }
    </author>

Advantages:
- concise
- easy to optimize

Disadvantages
- semantics are complex
- syntax is hard to explain
- similar syntax to SQL group by, but different semantics

>It has been asserted that the relational grouping operator (GROUP BY) is 
>not appropriate for use in an XML query language.  I admit that I don't 
>fully understand the arguments (yet), but I have a sort of intuitive 
>understanding of at least some of the issues.  The most important one is, 
>of course, that the relational GROUP BY operates strictly on a virtual 
>table (a regular, rectangular structure with every cell occupied by 
>*something*, even if the null value), while an XQuery grouping operator 
>would have to operate on semi-structured tree structures.  The challenges 
>of the latter clear exceed those of the former, even if some lessons of 
>the former can be learned and used in designing the latter.

A lot of this boils down to the following question: in the XML Query Data 
Model, what is a group? You might answer that a group is defined in the 
tuple stream of XQuery rather than in the Data Model per se. But even the 
tuple stream of XQuery is different from that of SQL.

Jonathan
Received on Saturday, 8 March 2003 08:13:19 UTC