- From: Jonathan Robie <jonathan.robie@datadirect-technologies.com>
- Date: Fri, 07 Mar 2003 20:39:03 -0500
- To: Jim Melton <jim.melton@acm.org>, Nitin Mangtani <nitinm@bea.com>
- Cc: Michael Burbidge <mburbidg@adobe.com>, www-ql@w3.org
At 08:54 AM 3/7/2003 -0700, Jim Melton wrote: >Jonathan Robie (and, to be fair, others) did a considerable amount of work >to try to persuade the WG of the necessity of having a grouping capability. But I withdrew myself from the agenda when I was shown an acceptable (if tedious) way to do grouping with the existing XQuery language. For what it's worth, I think we really need use cases for grouping before we can design this feature right. So far, I have seen a lot of solutions to the grouping problem in XQuery, but the only usage scenarios I have seen were done by Jeni Tennison, and are available on her web site ("http://www.jenitennison.com", I think - I am on an airplane right now and hence offline). We need use cases so that we know what problems we need to solve when we get around to this in Level 2. They should discuss what problem the user is trying to solve, and show the input XML document(s) and the output XML. I can't emphasize enough how useful the use cases in our W3C XML Query Use Cases document have been in designing the language. >Recently, Jonathan made a valiant effort to get a grouping capability into >the language...sort of a last-minute attempt. I was originally convinced that we needed grouping to handle queries that group by multiple fields - at one point, we could do that with distinct-values(), which worked with complex values, but always using deep equality: <results> { for $a in distinct-values(document("http://www.bn.com/bib.xml")//author) return <result> { $a } { for $b in document("http://www.bn.com/bib.xml")/bib/book where some $ba in $b/author satisfies deep-equal($ba,$a) return $b/title } </result> } </results> >Don Chamberlin wrote a message to Jonathan, me, and at least a few others >(perhaps to the entire Query WG?) that illustrated a grouping capability >that is quite useful in certain situations (e.g., grouping by individual >elements and not multiple elements. Right - here's the solution that Don proposed, which is based on a suggestion from Mike Kay: for $last in distinct-values(input()//author/last), $first in distinct-values(input()//author[last=$last]/first) let $books := input()//book [some $ba in $books/author satisfies $ba/last = $last and $ba/first = $first] return <author> <name> <last>{$last}</last> <first>{$first}</first> </name> {$books} </author> That works quite generally - it's a bit fiddly for the query author, and many systems may not optimize it all that well, but plenty good for Version 1.0. It really is time to stop adding new features unless the language is broken without them. >This approach is probably "good enough" for version 1, allowing us to >consider a more complete grouping facility in a future version of the >XQuery specification. I agree. For what it's worth, here are the other options that I presented to the Working Group before I saw Don's proposal. My personal favorite is currently Option 4, but we can worry about that in Level 2. All of these are different from SQL's group by. Others have suggested a grouping facility modelled more directly on that of SQL. ======================================================================== Option 1: Restore complex distinct-values() Obviously, one way of fixing the use case is to restore the previous definition of distinct-values(). Advantages: - straightforward Disadvantages: - deep-equal is fragile, and does not allow subitems to be ignored - concerns about optimizability of distinct-values() - complicated and somewhat weird definition of distinct-values() ======================================================================== Option 2: String hack The Working Group could simply decide to use a string hack to solve this problem in the use cases. I believe the following query solves the use case correctly: define function string-hack($a as element(author))*) as xs:string* { for $aa in $a return "[[last:"+string($a/last)+"]" + "[first:"+string($a/first)+"]]" } <results> { let $in := document("http://www.bn.com/bib.xml")//book for $a in distinct-values(string-hack($in//author)) return <result> { $a } { for $b in document("http://www.bn.com/bib.xml")/bib/book where some $ba in $b/author satisfies string-hack($ba)=string-hack($a) return $b/title } </result> } </results> Advantages: - Gets grouping off our plate Disadvantages: - Complex for users - Hard to optimize ======================================================================== Option 3: With Distinct + Quantifiers Mike Kay has suggested an operator called "with distinct" that returns items from a sequence that are unique with respect to a list of expressions. For instance, the following query returns a sequence of authors in which no two authors have the same first and last names: input()//author with distinct( last, first ) The syntax of this operator is: WithDistinct ::= Expr "with" "distinct" ( Expr ("," Expr)* ) The expressions in parentheses are evaluated for each node from the left-hand expression, using that node as the context node, to produce a set of values. For each distinct set of values, a corresponding node from the left-hand expression is returned. This allows us to solve the use case as follows: let $bin := input()//book for $a in $bin//author with distinct(last, first) let $b := $bin[some $ba in author satisfies ($ba/last=$a/last and $ba/first=$a/first)] return <author> <name>{ $a/last, $a/first }</name>, { $b/title } </author> Advantages: - with distinct "does the right thing" - it's "the simplest thing that works" Disadvantages - matching books is still a little fiddly - perhaps not easily optimizable When I say that matching books is still a little fiddly, note that the following is not a correct solution, for reasons that may not be obvious to all users: let $bin := input()//book for $a in $bin//author with distinct(last, first) let $b := $bin[$ba/last=$a/last and $ba/first=$a/first] return <author> <name>{ $a/last, $a/first }</name>, { $b } </author> ======================================================================== Option 4: With Distinct + Matches Option 3 makes it possible for users to do grouping in a meaningful way, but is still somewhat tricky for users, and does not leave obvious clues for query processors to know how to optimize. For the sample query, with distinct makes it easy to find unique authors, but we might want an easier way to find books that correspond to a given author. We could introduce another postfix operator, "matches", which returns nodes that match a given node by a set of criteria. For instance, if $a is bound to a given author, then the following expression finds authors that match $a by last name and first name: input()//author matches $a by (last, first) This can be used to solve our query as follows: let $bin := input()//book for $a in $bin//author with distinct(last, first) let $b := $bin[author matches $a by (last, first)] return <author> <name>{ $a/last, $a/first }</name>, { $b } </author> The syntax of matches is: Matches ::= Expr "matches" Expr "by" ( Expr ("," Expr)* ) When matches compares two nodes, it evaluates the expressions in the parentheses for each, using that node as the context node. If the left-hand node matches the right-hand node, it is included in the result. Advantages: - with distinct "does the right thing" - easy to explain to users - concise - patterns leave clues for optimizer - each expression is useful in its own right - no change to FLWOR expressions or tuple generation Disadvantages - too much for level 1? ======================================================================== Option 5: Group By Some people prefer explicit group by. In XQuery, I think it is most natural to do this as a separate expression, similar to the FLWOR expression. In the following example, each group is a tuple in which each $a is bound to one distinct author and each $b is bound to the books corresponding to that author: group $b := input()//book by $a in $b//author with distinct (last, first) order by $a/first, $a/last return <author> { <name>{ $a/last, $a/first }</name>, $b/title } </author> Advantages: - concise - easy to optimize Disadvantages - semantics are complex - syntax is hard to explain - similar syntax to SQL group by, but different semantics >It has been asserted that the relational grouping operator (GROUP BY) is >not appropriate for use in an XML query language. I admit that I don't >fully understand the arguments (yet), but I have a sort of intuitive >understanding of at least some of the issues. The most important one is, >of course, that the relational GROUP BY operates strictly on a virtual >table (a regular, rectangular structure with every cell occupied by >*something*, even if the null value), while an XQuery grouping operator >would have to operate on semi-structured tree structures. The challenges >of the latter clear exceed those of the former, even if some lessons of >the former can be learned and used in designing the latter. A lot of this boils down to the following question: in the XML Query Data Model, what is a group? You might answer that a group is defined in the tuple stream of XQuery rather than in the Data Model per se. But even the tuple stream of XQuery is different from that of SQL. Jonathan
Received on Saturday, 8 March 2003 08:13:19 UTC