# Re: Order of evaluation for aggregates

From: Steve Harris <steve.harris@garlik.com>
Date: Tue, 22 Nov 2011 12:50:00 +0000
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, SPARQL Working Group <public-rdf-dawg@w3.org>
Message-Id: <88FCC533-EF4D-4CAF-8945-4256EF8FDDCD@garlik.com>

```OK, I think I have a solution, which bridges the gap, and doesn't violate any scope rules:

For each expression E in SELECT and each HAVING(E) in Q

If E does not contain an aggregate
Replace E with Sample(E) in Q
End

For each variable V appearing outside of an aggregate
Ai := Aggregation(V, Sample, {}, G)
Replace V with aggi in E
Q := Extend(Q, V, aggi)
i := i + 1
End

For each aggregate X(args ; scalarvals) now in E
# note scalarvals may be omitted, then it's equivalent to the empty set
Ai := Aggregation(args, X, scalarvals, G)
Replace X(...) with aggi in Q
i := i + 1
End

End

So, SELECT ?x {?x :p ?v } GROUP BY ?x

Becomes

A=AggregateJoin(Aggregation((?x), SAMPLE, {}, A))

rewritten query:
SELECT ?agg1 {?x :p ?v } GROUP BY ?x

The the result is
Extend(Project(A, {?agg1}), ?x, ?agg1)

- Steve

On 2011-11-21, at 21:26, Birte Glimm wrote:

> On 21 November 2011 15:28, Andy Seaborne <andy.seaborne@epimorphics.com> wrote:
>> Steve,
>>
>> Could you walk through the translation of:
>>
>> SELECT ?x {?x :p ?v } GROUP BY ?x
>
> Up to aggregation you get:
> A=Group((?x), Bgp(?x :p ?v))
> rewritten query:
> SELECT SAMPLE(?x) {?x :p ?v } GROUP BY ?x
> Then
> A=AggregateJoin(Aggregation((?x), SAMPLE, {}, A))
> rewritten query:
> SELECT ?agg1 {?x :p ?v } GROUP BY ?x
> Finally:
> Project(A, {?agg1})
>
> Now you evaluate and get results where ?agg1 has some binding, but ?x
> is lost and Extend(...) is not used to extend the solutions so that ?x
> gets the values that ?agg1 has.
>
> Birte
>
>
>> This is the case where there is no AS because the variable is part of the
>> group key.
>>
>>        Andy
>>
>> On 17/11/11 09:00, Birte Glimm wrote:
>>>
>>> On 16 November 2011 19:08, Steve Harris<steve.harris@garlik.com>  wrote:
>>>>
>>>> On 2011-11-16, at 16:58, Birte Glimm wrote:
>>>>
>>>>> [snip]
>>>>>>>>
>>>>>>>> As syntax, (SAMPLE(?x) AS ?x) isn't legal because AS has to introduce
>>>>>>>> a new
>>>>>>>> variable. This happens in SELECT expression processing a few
>>>>>>>> subsections.
>>>>>>>
>>>>>>> Yes, that occured to me as well. Unless it is made legal for
>>>>>>> intermediate queries, which I don't like, there seems no way around
>>>>>>> creating solutions in the aggregate join that also contain the grouped
>>>>>>> variables.
>>>>>>
>>>>>> Yes, that why I ended up with the messy agg_i thing, to avoid
>>>>>> conflating aggregate results with variable names.
>>>>>>
>>>>>> Reading through this again, I think that the text as written is
>>>>>> correct:
>>>>>>
>>>>>>    For each variable V appearing outside of an aggregate
>>>>>>        Replace V with Sample(V) in Q
>>>>>>        End
>>>>>>
>>>>>> ensures that there's only aggregates being projected, then
>>>>>>
>>>>>>    For each aggregate X(args ; scalarvals) now in E
>>>>>>        # note scalarvals may be omitted, then it's equivalent to the
>>>>>> empty set
>>>>>>        Ai := Aggregation(args, X, scalarvals, G)
>>>>>>        Replace X(...) with aggi in Q
>>>>>>        i := i + 1
>>>>>>        End
>>>>>>
>>>>>> Defines A_i/agg_i for the Sample(V) above. I could well have spec
>>>>>> blindness though.
>>>>>
>>>>> That still does not solve the problem that you loose the original
>>>>> variable name, so results will contain agg_i and even worse, if you
>>>>> have a having clause, the variable used there might no longer exist
>>>>> since it was replaced by agg_i.
>>>>
>>>> I believe that's taken care of by AggregateJoin:
>>>>
>>>> Write A = (A1, A2, ...) where Ai = Aggregation(exprListi, funci,
>>>> scalarvarsi, P)
>>>>
>>>> eval(D(G), AggregateJoin(A)) = { (agg1, v1), ..., (aggn, vn) | vi such
>>>> that ( k, vi ) in eval(D(G), Ai)
>>>> for some k and each 1<= i<= n }
>>>>
>>>> vi is your var below, I think.
>>>
>>> I don't think. vi is the aggregated value, so you ( k, vi ) with k the
>>> key that was used to group the values and vi the aggregated value.
>>> I'll do one complete example to make clear where I think we have a
>>> problem:
>>>
>>> Assume data:
>>> ex:Birte ex:mark 4, 5 .
>>> ex:Steve ex:mark 3, 5 .
>>> and query:
>>> SELECT ?name (MAX(?mark) AS ?max) (AVG(?mark) AS ?avg)
>>> WHERE { ?name ex:mark ?mark } GROUP BY ?name
>>>
>>> Algebra translation up to group gives:
>>> A=Group((?name), Bgp(...))
>>> Replace ?name with SAMPLE(?name)
>>> SELECT SAMPLE(?name) (MAX(?mark) AS ?max) (AVG(?mark) AS ?avg)
>>> WHERE { ?name ex:mark ?mark } GROUP BY ?name
>>> continue transformation:
>>> A= AggregateJoin(
>>>   Aggregation((?name), SAMPLE, {}, A),
>>>   Aggregation((?mark), MAX, {}, A),
>>>   Aggregation((?mark), AVG, {}, A)
>>> )
>>> plus we rewrite the aggregates with aggi:
>>> SELECT ?agg1 (?agg2 AS ?max) (?agg3 AS ?avg)
>>> WHERE { ?name ex:mark ?mark } GROUP BY ?name
>>> We then translate the expressions:
>>> A=Extend(A, ?max, ?agg2)
>>> A=Extend(A, ?avg, ?agg3)
>>> Finally projection:
>>> Projcet(A, {?agg1, ?max, ?avg})
>>>
>>> Evaluation the BGP gives 4 solutions:
>>> mu1 : ?name ->  ex:Birte, ?mark ->  4
>>> mu2 : ?name ->  ex:Birte, ?mark ->  5
>>> mu3 : ?name ->  ex:Steve, ?mark ->  3
>>> mu4 : ?name ->  ex:Steve ?mark ->  5
>>> After grouping we get:
>>> { ex:Birte ->  {mu1, mu2}, ex:Steve ->  {mu3, mu4} }
>>> The three aggregates give
>>> SAMPLE: { (ex:Birte)->ex:Birte, (ex:Steve)->ex:Steve }
>>> MAX: { (ex:Birte)->5, (ex:Steve)->5 }
>>> AVG: { (ex:Birte)->4.5, (ex:Steve)->4 }
>>> Joining the aggregates gives 2 new solutions:
>>> mu_a: ?agg1->ex:Birte, ?agg2->5, ?agg3->4.5
>>> mu_b : ?agg1->ex:Steve, ?agg2->5, ?agg3->4
>>> We extend the solutions:
>>> mu_a' = mu_a union ?max->5, ?avg->4.5
>>> mu_b' = mu_a union ?max->5, ?avg->4
>>> Project then throws out ?agg2 and ?agg3 and we are done.
>>>
>>> We have, however, ?agg1 instead of ?name and if we had used HAVING
>>> with some condition on ?name we would have a problem.
>>>
>>> Do I do something wrong here or do we indeed have ?agg1 in the result
>>> and ?name got lost?
>>>
>>> Birte
>>>
>>>
>>>
>>>
>>>
>>>>
>>>> The way the document is structured moves these apart in an unfortunate
>>>> way.
>>>>
>>>> - Steve
>>>>
>>>>>
>>>>> How about having two loops
>>>>> For each aggregate (X(args ; scalarvals) AS var) now in E
>>>>>        # note scalarvals may be omitted, then it's equivalent to the
>>>>> empty set
>>>>>        Ai := Aggregation(args, X, scalarvals, G)
>>>>>        Replace X(...) with aggi in Q
>>>>>        i := i + 1
>>>>>        End
>>>>> For each aggregate X(args ; scalarvals) now in E
>>>>>        # note scalarvals may be omitted, then it's equivalent to the
>>>>> empty set
>>>>>        Ai := Aggregation(args, X, scalarvals, G)
>>>>>       Replace X(var; scalarvals) with (aggi AS var) in Q
>>>>>        i := i + 1
>>>>>        End
>>>>>
>>>>> This way, we never have an illegal syntax form, we guarantee that all
>>>>> variables are still available after the aggregation and since AS is
>>>>> only processed later all seems to be fine. One could of course think
>>>>> about handling both cases in one loop although for the spec having two
>>>>> loops seems fine to me.
>>>>>
>>>>> Birte
>>>>>
>>>>>> - Steve
>>>>>>
>>>>>>>> There is no definition of "Aggregation".  It's mentioned in 11.2 but
>>>>>>>> the
>>>>>>>> link goes to "Definition: Evaluation of Aggregation".  There should a
>>>>>>>> definition (just after group?) in 18.4.
>>>>>>>
>>>>>>> Yes, I also wondered about that. It is somehow clear how to evaluate,
>>>>>>> but it would be much more consistent if there were a definition.
>>>>>>>
>>>>>>>> I looked because I wondered if we could just have an "?x" as the
>>>>>>>> "aggregate".
>>>>>>>
>>>>>>> Not sure I understand this.
>>>>>>>
>>>>>>>> But I think, as Birte shows, as because it's done by syntactic
>>>>>>>> rewriting, just leaving it as "?x" would work.
>>>>>>>
>>>>>>> As I don't understand the sentence above. I just want to make my point
>>>>>>> again that we need a binding for ?x if ?x is grouped but not in an
>>>>>>> aggregate as it can be used in the HAVING clause. If, at the point of
>>>>>>> evaluating HAVING, we only have agg_1, we can't filter on ?x.
>>>>>>>
>>>>>>>>> I wanted to convert the plain ?x projection to an aggregate so it
>>>>>>>>> was
>>>>>>>>> consistent with the rest of the projections, but expressing it
>>>>>>>>> explicitly
>>>>>>>>> would be equivalent I think.
>>>>>>>>>
>>>>>>>>> I will have a run through the aggregation text and see if I can make
>>>>>>>>> that
>>>>>>>>> change with a relatively small change to the document.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>    Steve
>>>>>>>>
>>>>>>>> I also noticed;
>>>>>>>>
>>>>>>>> [[
>>>>>>>> Definition: Evaluation of AggregateJoin
>>>>>>>> ...
>>>>>>>> Note that if eval(D(G), Ai) is an error, it is ignored.
>>>>>>>> ]]
>>>>>>>>
>>>>>>>>  An error causes an error doesn't it?  (AS causes it to be unbound)
>>>>>>>
>>>>>>> AS is transformed into Extend(), which is evaluated:
>>>>>>> Extend(μ, var, expr) = μ ∪ { (var,value) | var not in dom(μ) and value
>>>>>>> = eval(expr) }
>>>>>>> Extend(μ, var, expr) = μ if var not in dom(μ) and eval(expr) is an
>>>>>>> error
>>>>>>>
>>>>>>> The latter makes the solution just not contain a mapping for the
>>>>>>> variable as I understand it.
>>>>>>>
>>>>>>> But while we are at it, there is a lowercase extend in the Definition
>>>>>>> of Extend:
>>>>>>> Extend(Ω , var, term) = { extend(μ, var, term) | μ in Ω }
>>>>>>>
>>>>>>> It is also lowercase in the evaluation semantics:
>>>>>>> Definition: Evaluation of Extend
>>>>>>> eval(D(G), extend(var, expr, P)) = extend(var, expr , eval(D(G), P))
>>>>>>> Furthermore, here we swap the order. It should be
>>>>>>> eval(D(G), Extend(P, var, expr)) = Extend(eval(D(G), P), var, expr)
>>>>>>> or the algorithm for translating queries into the algrebra is wrong
>>>>>>> and has to be changed.
>>>>>>>
>>>>>>> Birte
>>>>>>>>
>>>>>>>>        Andy
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
>>>>>>> Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
>>>>>>> University of Ulm                         Fax:   +49 731 50 24188
>>>>>>> D-89069 Ulm                               birte.glimm@uni-ulm.de
>>>>>>> Germany
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Steve Harris, CTO, Garlik Limited
>>>>>> 1-3 Halford Road, Richmond, TW10 6AW, UK
>>>>>> +44 20 8439 8203  http://www.garlik.com/
>>>>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>>>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
>>>>> Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
>>>>> University of Ulm                         Fax:   +49 731 50 24188
>>>>> D-89069 Ulm                               birte.glimm@uni-ulm.de
>>>>> Germany
>>>>>
>>>>
>>>> --
>>>> Steve Harris, CTO, Garlik Limited
>>>> 1-3 Halford Road, Richmond, TW10 6AW, UK
>>>> +44 20 8439 8203  http://www.garlik.com/
>>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>
> --
> Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
> Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
> University of Ulm                         Fax:   +49 731 50 24188
> D-89069 Ulm                               birte.glimm@uni-ulm.de
> Germany
>

--
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11