Re: Order of evaluation for aggregates from Steve Harris on 2011-11-22 (public-rdf-dawg@w3.org from October to December 2011)

From: Steve Harris <steve.harris@garlik.com>
Date: Tue, 22 Nov 2011 12:50:00 +0000
To: birte.glimm@uni-ulm.de
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, SPARQL Working Group <public-rdf-dawg@w3.org>
Message-Id: <88FCC533-EF4D-4CAF-8945-4256EF8FDDCD@garlik.com>
OK, I think I have a solution, which bridges the gap, and doesn't violate any scope rules:

For each expression E in SELECT and each HAVING(E) in Q

    If E does not contain an aggregate
        Replace E with Sample(E) in Q
        End

    For each variable V appearing outside of an aggregate
        Ai := Aggregation(V, Sample, {}, G)
        Replace V with aggi in E
 Q := Extend(Q, V, aggi)
        i := i + 1
        End

    For each aggregate X(args ; scalarvals) now in E
        # note scalarvals may be omitted, then it's equivalent to the empty set
        Ai := Aggregation(args, X, scalarvals, G)
        Replace X(...) with aggi in Q
        i := i + 1
        End

    End

So, SELECT ?x {?x :p ?v } GROUP BY ?x

Becomes

A=AggregateJoin(Aggregation((?x), SAMPLE, {}, A))

rewritten query:
SELECT ?agg1 {?x :p ?v } GROUP BY ?x

The the result is
Extend(Project(A, {?agg1}), ?x, ?agg1)

- Steve

On 2011-11-21, at 21:26, Birte Glimm wrote:

> On 21 November 2011 15:28, Andy Seaborne <andy.seaborne@epimorphics.com> wrote:
>> Steve,
>> 
>> Could you walk through the translation of:
>> 
>> SELECT ?x {?x :p ?v } GROUP BY ?x
> 
> Up to aggregation you get:
> A=Group((?x), Bgp(?x :p ?v))
> rewritten query:
> SELECT SAMPLE(?x) {?x :p ?v } GROUP BY ?x
> Then
> A=AggregateJoin(Aggregation((?x), SAMPLE, {}, A))
> rewritten query:
> SELECT ?agg1 {?x :p ?v } GROUP BY ?x
> Finally:
> Project(A, {?agg1})
> 
> Now you evaluate and get results where ?agg1 has some binding, but ?x
> is lost and Extend(...) is not used to extend the solutions so that ?x
> gets the values that ?agg1 has.
> 
> Birte
> 
> 
>> This is the case where there is no AS because the variable is part of the
>> group key.
>> 
>>        Andy
>> 
>> On 17/11/11 09:00, Birte Glimm wrote:
>>> 
>>> On 16 November 2011 19:08, Steve Harris<steve.harris@garlik.com>  wrote:
>>>> 
>>>> On 2011-11-16, at 16:58, Birte Glimm wrote:
>>>> 
>>>>> [snip]
>>>>>>>> 
>>>>>>>> As syntax, (SAMPLE(?x) AS ?x) isn't legal because AS has to introduce
>>>>>>>> a new
>>>>>>>> variable. This happens in SELECT expression processing a few
>>>>>>>> subsections.
>>>>>>> 
>>>>>>> Yes, that occured to me as well. Unless it is made legal for
>>>>>>> intermediate queries, which I don't like, there seems no way around
>>>>>>> creating solutions in the aggregate join that also contain the grouped
>>>>>>> variables.
>>>>>> 
>>>>>> Yes, that why I ended up with the messy agg_i thing, to avoid
>>>>>> conflating aggregate results with variable names.
>>>>>> 
>>>>>> Reading through this again, I think that the text as written is
>>>>>> correct:
>>>>>> 
>>>>>>    For each variable V appearing outside of an aggregate
>>>>>>        Replace V with Sample(V) in Q
>>>>>>        End
>>>>>> 
>>>>>> ensures that there's only aggregates being projected, then
>>>>>> 
>>>>>>    For each aggregate X(args ; scalarvals) now in E
>>>>>>        # note scalarvals may be omitted, then it's equivalent to the
>>>>>> empty set
>>>>>>        Ai := Aggregation(args, X, scalarvals, G)
>>>>>>        Replace X(...) with aggi in Q
>>>>>>        i := i + 1
>>>>>>        End
>>>>>> 
>>>>>> Defines A_i/agg_i for the Sample(V) above. I could well have spec
>>>>>> blindness though.
>>>>> 
>>>>> That still does not solve the problem that you loose the original
>>>>> variable name, so results will contain agg_i and even worse, if you
>>>>> have a having clause, the variable used there might no longer exist
>>>>> since it was replaced by agg_i.
>>>> 
>>>> I believe that's taken care of by AggregateJoin:
>>>> 
>>>> Write A = (A1, A2, ...) where Ai = Aggregation(exprListi, funci,
>>>> scalarvarsi, P)
>>>> 
>>>> eval(D(G), AggregateJoin(A)) = { (agg1, v1), ..., (aggn, vn) | vi such
>>>> that ( k, vi ) in eval(D(G), Ai)
>>>> for some k and each 1<= i<= n }
>>>> 
>>>> vi is your var below, I think.
>>> 
>>> I don't think. vi is the aggregated value, so you ( k, vi ) with k the
>>> key that was used to group the values and vi the aggregated value.
>>> I'll do one complete example to make clear where I think we have a
>>> problem:
>>> 
>>> Assume data:
>>> ex:Birte ex:mark 4, 5 .
>>> ex:Steve ex:mark 3, 5 .
>>> and query:
>>> SELECT ?name (MAX(?mark) AS ?max) (AVG(?mark) AS ?avg)
>>> WHERE { ?name ex:mark ?mark } GROUP BY ?name
>>> 
>>> Algebra translation up to group gives:
>>> A=Group((?name), Bgp(...))
>>> Replace ?name with SAMPLE(?name)
>>> SELECT SAMPLE(?name) (MAX(?mark) AS ?max) (AVG(?mark) AS ?avg)
>>> WHERE { ?name ex:mark ?mark } GROUP BY ?name
>>> continue transformation:
>>> A= AggregateJoin(
>>>   Aggregation((?name), SAMPLE, {}, A),
>>>   Aggregation((?mark), MAX, {}, A),
>>>   Aggregation((?mark), AVG, {}, A)
>>> )
>>> plus we rewrite the aggregates with aggi:
>>> SELECT ?agg1 (?agg2 AS ?max) (?agg3 AS ?avg)
>>> WHERE { ?name ex:mark ?mark } GROUP BY ?name
>>> We then translate the expressions:
>>> A=Extend(A, ?max, ?agg2)
>>> A=Extend(A, ?avg, ?agg3)
>>> Finally projection:
>>> Projcet(A, {?agg1, ?max, ?avg})
>>> 
>>> Evaluation the BGP gives 4 solutions:
>>> mu1 : ?name ->  ex:Birte, ?mark ->  4
>>> mu2 : ?name ->  ex:Birte, ?mark ->  5
>>> mu3 : ?name ->  ex:Steve, ?mark ->  3
>>> mu4 : ?name ->  ex:Steve ?mark ->  5
>>> After grouping we get:
>>> { ex:Birte ->  {mu1, mu2}, ex:Steve ->  {mu3, mu4} }
>>> The three aggregates give
>>> SAMPLE: { (ex:Birte)->ex:Birte, (ex:Steve)->ex:Steve }
>>> MAX: { (ex:Birte)->5, (ex:Steve)->5 }
>>> AVG: { (ex:Birte)->4.5, (ex:Steve)->4 }
>>> Joining the aggregates gives 2 new solutions:
>>> mu_a: ?agg1->ex:Birte, ?agg2->5, ?agg3->4.5
>>> mu_b : ?agg1->ex:Steve, ?agg2->5, ?agg3->4
>>> We extend the solutions:
>>> mu_a' = mu_a union ?max->5, ?avg->4.5
>>> mu_b' = mu_a union ?max->5, ?avg->4
>>> Project then throws out ?agg2 and ?agg3 and we are done.
>>> 
>>> We have, however, ?agg1 instead of ?name and if we had used HAVING
>>> with some condition on ?name we would have a problem.
>>> 
>>> Do I do something wrong here or do we indeed have ?agg1 in the result
>>> and ?name got lost?
>>> 
>>> Birte
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> 
>>>> The way the document is structured moves these apart in an unfortunate
>>>> way.
>>>> 
>>>> - Steve
>>>> 
>>>>> 
>>>>> How about having two loops
>>>>> For each aggregate (X(args ; scalarvals) AS var) now in E
>>>>>        # note scalarvals may be omitted, then it's equivalent to the
>>>>> empty set
>>>>>        Ai := Aggregation(args, X, scalarvals, G)
>>>>>        Replace X(...) with aggi in Q
>>>>>        i := i + 1
>>>>>        End
>>>>> For each aggregate X(args ; scalarvals) now in E
>>>>>        # note scalarvals may be omitted, then it's equivalent to the
>>>>> empty set
>>>>>        Ai := Aggregation(args, X, scalarvals, G)
>>>>>       Replace X(var; scalarvals) with (aggi AS var) in Q
>>>>>        i := i + 1
>>>>>        End
>>>>> 
>>>>> This way, we never have an illegal syntax form, we guarantee that all
>>>>> variables are still available after the aggregation and since AS is
>>>>> only processed later all seems to be fine. One could of course think
>>>>> about handling both cases in one loop although for the spec having two
>>>>> loops seems fine to me.
>>>>> 
>>>>> Birte
>>>>> 
>>>>>> - Steve
>>>>>> 
>>>>>>>> There is no definition of "Aggregation".  It's mentioned in 11.2 but
>>>>>>>> the
>>>>>>>> link goes to "Definition: Evaluation of Aggregation".  There should a
>>>>>>>> definition (just after group?) in 18.4.
>>>>>>> 
>>>>>>> Yes, I also wondered about that. It is somehow clear how to evaluate,
>>>>>>> but it would be much more consistent if there were a definition.
>>>>>>> 
>>>>>>>> I looked because I wondered if we could just have an "?x" as the
>>>>>>>> "aggregate".
>>>>>>> 
>>>>>>> Not sure I understand this.
>>>>>>> 
>>>>>>>> But I think, as Birte shows, as because it's done by syntactic
>>>>>>>> rewriting, just leaving it as "?x" would work.
>>>>>>> 
>>>>>>> As I don't understand the sentence above. I just want to make my point
>>>>>>> again that we need a binding for ?x if ?x is grouped but not in an
>>>>>>> aggregate as it can be used in the HAVING clause. If, at the point of
>>>>>>> evaluating HAVING, we only have agg_1, we can't filter on ?x.
>>>>>>> 
>>>>>>>>> I wanted to convert the plain ?x projection to an aggregate so it
>>>>>>>>> was
>>>>>>>>> consistent with the rest of the projections, but expressing it
>>>>>>>>> explicitly
>>>>>>>>> would be equivalent I think.
>>>>>>>>> 
>>>>>>>>> I will have a run through the aggregation text and see if I can make
>>>>>>>>> that
>>>>>>>>> change with a relatively small change to the document.
>>>>>>>>> 
>>>>>>>>> Cheers,
>>>>>>>>>    Steve
>>>>>>>> 
>>>>>>>> I also noticed;
>>>>>>>> 
>>>>>>>> [[
>>>>>>>> Definition: Evaluation of AggregateJoin
>>>>>>>> ...
>>>>>>>> Note that if eval(D(G), Ai) is an error, it is ignored.
>>>>>>>> ]]
>>>>>>>> 
>>>>>>>>  An error causes an error doesn't it?  (AS causes it to be unbound)
>>>>>>> 
>>>>>>> AS is transformed into Extend(), which is evaluated:
>>>>>>> Extend(μ, var, expr) = μ ∪ { (var,value) | var not in dom(μ) and value
>>>>>>> = eval(expr) }
>>>>>>> Extend(μ, var, expr) = μ if var not in dom(μ) and eval(expr) is an
>>>>>>> error
>>>>>>> 
>>>>>>> The latter makes the solution just not contain a mapping for the
>>>>>>> variable as I understand it.
>>>>>>> 
>>>>>>> But while we are at it, there is a lowercase extend in the Definition
>>>>>>> of Extend:
>>>>>>> Extend(Ω , var, term) = { extend(μ, var, term) | μ in Ω }
>>>>>>> 
>>>>>>> It is also lowercase in the evaluation semantics:
>>>>>>> Definition: Evaluation of Extend
>>>>>>> eval(D(G), extend(var, expr, P)) = extend(var, expr , eval(D(G), P))
>>>>>>> Furthermore, here we swap the order. It should be
>>>>>>> eval(D(G), Extend(P, var, expr)) = Extend(eval(D(G), P), var, expr)
>>>>>>> or the algorithm for translating queries into the algrebra is wrong
>>>>>>> and has to be changed.
>>>>>>> 
>>>>>>> Birte
>>>>>>>> 
>>>>>>>>        Andy
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
>>>>>>> Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
>>>>>>> University of Ulm                         Fax:   +49 731 50 24188
>>>>>>> D-89069 Ulm                               birte.glimm@uni-ulm.de
>>>>>>> Germany
>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Steve Harris, CTO, Garlik Limited
>>>>>> 1-3 Halford Road, Richmond, TW10 6AW, UK
>>>>>> +44 20 8439 8203  http://www.garlik.com/
>>>>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>>>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10
>>>>>> 9AD
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
>>>>> Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
>>>>> University of Ulm                         Fax:   +49 731 50 24188
>>>>> D-89069 Ulm                               birte.glimm@uni-ulm.de
>>>>> Germany
>>>>> 
>>>> 
>>>> --
>>>> Steve Harris, CTO, Garlik Limited
>>>> 1-3 Halford Road, Richmond, TW10 6AW, UK
>>>> +44 20 8439 8203  http://www.garlik.com/
>>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
> Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
> University of Ulm                         Fax:   +49 731 50 24188
> D-89069 Ulm                               birte.glimm@uni-ulm.de
> Germany
> 

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Tuesday, 22 November 2011 12:50:50 UTC