Implicit vs. Explicit Grouping (Aggregates)

This email discharges my action 
http://www.w3.org/2009/sparql/track/actions/23 in reference to this 
issue: http://www.w3.org/2009/sparql/track/issues/11.

When we surveyed existing approaches to aggregates at the F2F[1], the 
majority of implementations required that groups be explicitly defined. 
These implementations behave as follows:

1) If there is a GROUP BY clause, the solution set is split into 
disjoint partitions (groups) based on distinct combinations of values of 
the GROUP BY variables.

2) If there is no GROUP BY clause, the entire solution set functions as 
a single group.

3) In the projection (SELECT clause), you may project anything that has 
a well-defined value for each group. This means the following:

   a) A scalar expression involving only constants and variable 
mentioned in the GROUP BY clause (the simplest case of this is 
projecting a group by variable itself)
   b) An aggregate expression

4) In the projection, it is an error to project out anything else, in 
particular a scalar expression involving any variable not explicitly 
listed in the GROUP BY clause.

It was mentioned that Virtuoso has implicit grouping. From the Virtuoso 
documentation[2], this design means:

1) There is no GROUP BY clause. Groups are always determined implicitly.

2) The grouping variables are determined by looking at the projected 
expressions. All variables mentioned in the projection (SELECT clause) 
that are _not_ part of an aggregate expression are considered as part of 
the grouping key.

3) If all projected expressions are aggregates, then the entire solution 
set functions as a single group.

3) Because the projection implicitly determines the groups, there are no 
error conditions.


I believe that standard SQL is similar to the explicit case described above.

Mysql documentation[3] states that MySQL extends this behavior so that 
projected variables that are not part of GROUP BY are _not_ an error, 
but instead behave similar to the SAMPLE aggregate[4] that we've briefly 
discussed in the past.

Based on the balance of current implementation experience and based on 
the SQL precedent, I'm going to suggest that we resolve ISSUE-11 in 
favor of explicit grouping and in favor of it being an error to project 
variables (or scalar functions on variables) not mentioned in GROUP BY.

Of course, I'll be happy to entertain suggestions to the contrary.

Lee

PS This is, again, how I'd like to see us proceed on issues - summary / 
proposal discussion / telecon discussion only if necessary / announced 
proposed resolution / telecon decision. During that process, proposals 
can and should also be worked out on the wiki, of course. Since we're 
dealing with UPDATE on this week's teleconference, the earliest we'd 
consider resolving this issue would be the following week - the agenda 
will list any proposed resolutions that the chairs intend to put to the 
group on a teleconference.



[1] http://www.w3.org/2009/sparql/meeting/2009-05-07#aggregates
[2] http://docs.openlinksw.com/virtuoso/rdfsparqlaggregate.html
[3] http://dev.mysql.com/doc/refman/6.0/en/group-by-hidden-columns.html
[4] http://www.w3.org/2009/sparql/wiki/Feature:SampleAggregate

Received on Tuesday, 26 May 2009 00:25:56 UTC