Order of evaluation for aggregates from Birte Glimm on 2011-11-10 (public-rdf-dawg@w3.org from October to December 2011)

From: Birte Glimm <birte.glimm@uni-ulm.de>
Date: Thu, 10 Nov 2011 11:59:37 +0100
To: SPARQL Working Group <public-rdf-dawg@w3.org>, Andy Seaborne <andy.seaborne@epimorphics.com>, Steve Harris <steve.harris@garlik.com>
Message-ID: <CABt65OdTybT55hLn8_71f+L_cgv_BE5F56cYoedCjY+OBr=C3Q@mail.gmail.com>

Steve, Andy, all,

I am trying to address AM-1, which is the question about evaluation
order for queries with aggregates. There is indeed inconsistency in
the spec regarding this and I believe we further miss a step in the
aggregate evaluation.

We first translate the query pattern, which works ok I believe and
results in some algebra object for the query pattern, say P. If the
query has aggregates, the solutions obtained by evaluation P have to
be grouped. The GROUP BY clause can itself contain expressions, e.g.,
GROUP BY ((?x + ?y) AS ?z). Thus, we need an Extend algebra object
before we can group, which we do not consider in the current algebra
translation.

Once we have extended the solutions with the group by clause
expressions, we can groupand then form the aggregate. One problem here
is that when we select a grouped variable, this variable is currently
replaced by a SAMPLE aggregate without assigning the aggregated value
back to the variable, so we get ?agg_i variables in the results. E.g.,
from
SELECT ?x (AVG(?y) AS ?z) { ... } GROUP BY ?x
we first get
P=Group((?x), bgp(...))
and the rewritten query
SELECT SAMPLE(?x) (AVG(?y) AS ?z) { ... } GROUP BY ?x
we then get
P=AggregateJoin(
  Aggregation((?x), SAMPLE, {}, P),
  Aggregation((?y), AVG {}, P)
)
and the rewritten query
SELECT ?agg_1 (?agg_2 AS ?z) { ... } GROUP BY ?x
We then get
Extend(P, ?z, ?agg_2)
?agg_1, however, remains as there is no AS ... in the SLECT clause.

This is not nice and might even lead to problems, when I use ?x in the
HAVING clause as ?x is no longer mapped by any solution mapping.

I see two possibilities to fix that:
1) Replace a non-aggregated variable ?x with (SAMPLE(?x) AS ?x)
2) There is not really a need to aggregate these variables at all, as
they are unique for each group (since only grouped variables can be
selected). thus, AggregateJoin(...) could be extended to construct the
solution mappings by including each grouped variable with its value
(as in the key for the aggregate) plues the aggregated values.


Thus, I believe, we have to have the following order in the algebra translation:
1) query pattern translation as normal
2) Translate GROUP expressions -> Extend(...)
3) Group
4) Aggregate (fixed to properly handle non-aggregated variables)
5) Extend
6) Filter (from HAVING)
Filter can only work after we have properly assigned the variables in step 5.

In the current algebra transformation we omit 2), and have 5) and 6) swapped.

Birte


-- 
Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
University of Ulm                         Fax:   +49 731 50 24188
D-89069 Ulm                               birte.glimm@uni-ulm.de
Germany

Received on Thursday, 10 November 2011 11:00:08 UTC