Re: Order of evaluation for aggregates from Steve Harris on 2011-11-21 (public-rdf-dawg@w3.org from October to December 2011)

From: Steve Harris <steve.harris@garlik.com>
Date: Mon, 21 Nov 2011 15:08:18 +0000
To: Andy Seaborne <andy.seaborne@epimorphics.com>
Cc: birte.glimm@uni-ulm.de, SPARQL Working Group <public-rdf-dawg@w3.org>
Message-Id: <1A3A8C1B-C464-4118-A479-8E129F14D387@garlik.com>
Yeah, I'm trying to swap this stuff in. I believe Birte has demonstrated that it doesn't join up, I just need to find a solution that doesn't violate any of the other rules.

- Steve

On 2011-11-21, at 14:28, Andy Seaborne wrote:

> Steve,
> 
> Could you walk through the translation of:
> 
> SELECT ?x {?x :p ?v } GROUP BY ?x
> 
> This is the case where there is no AS because the variable is part of the group key.
> 
>  Andy
> 
> On 17/11/11 09:00, Birte Glimm wrote:
>> On 16 November 2011 19:08, Steve Harris<steve.harris@garlik.com>  wrote:
>>> On 2011-11-16, at 16:58, Birte Glimm wrote:
>>> 
>>>> [snip]
>>>>>>> As syntax, (SAMPLE(?x) AS ?x) isn't legal because AS has to introduce a new
>>>>>>> variable. This happens in SELECT expression processing a few subsections.
>>>>>> 
>>>>>> Yes, that occured to me as well. Unless it is made legal for
>>>>>> intermediate queries, which I don't like, there seems no way around
>>>>>> creating solutions in the aggregate join that also contain the grouped
>>>>>> variables.
>>>>> 
>>>>> Yes, that why I ended up with the messy agg_i thing, to avoid conflating aggregate results with variable names.
>>>>> 
>>>>> Reading through this again, I think that the text as written is correct:
>>>>> 
>>>>>    For each variable V appearing outside of an aggregate
>>>>>        Replace V with Sample(V) in Q
>>>>>        End
>>>>> 
>>>>> ensures that there's only aggregates being projected, then
>>>>> 
>>>>>    For each aggregate X(args ; scalarvals) now in E
>>>>>        # note scalarvals may be omitted, then it's equivalent to the empty set
>>>>>        Ai := Aggregation(args, X, scalarvals, G)
>>>>>        Replace X(...) with aggi in Q
>>>>>        i := i + 1
>>>>>        End
>>>>> 
>>>>> Defines A_i/agg_i for the Sample(V) above. I could well have spec blindness though.
>>>> 
>>>> That still does not solve the problem that you loose the original
>>>> variable name, so results will contain agg_i and even worse, if you
>>>> have a having clause, the variable used there might no longer exist
>>>> since it was replaced by agg_i.
>>> 
>>> I believe that's taken care of by AggregateJoin:
>>> 
>>> Write A = (A1, A2, ...) where Ai = Aggregation(exprListi, funci, scalarvarsi, P)
>>> 
>>> eval(D(G), AggregateJoin(A)) = { (agg1, v1), ..., (aggn, vn) | vi such that ( k, vi ) in eval(D(G), Ai)
>>> for some k and each 1<= i<= n }
>>> 
>>> vi is your var below, I think.
>> 
>> I don't think. vi is the aggregated value, so you ( k, vi ) with k the
>> key that was used to group the values and vi the aggregated value.
>> I'll do one complete example to make clear where I think we have a
>> problem:
>> 
>> Assume data:
>> ex:Birte ex:mark 4, 5 .
>> ex:Steve ex:mark 3, 5 .
>> and query:
>> SELECT ?name (MAX(?mark) AS ?max) (AVG(?mark) AS ?avg)
>> WHERE { ?name ex:mark ?mark } GROUP BY ?name
>> 
>> Algebra translation up to group gives:
>> A=Group((?name), Bgp(...))
>> Replace ?name with SAMPLE(?name)
>> SELECT SAMPLE(?name) (MAX(?mark) AS ?max) (AVG(?mark) AS ?avg)
>> WHERE { ?name ex:mark ?mark } GROUP BY ?name
>> continue transformation:
>> A= AggregateJoin(
>>   Aggregation((?name), SAMPLE, {}, A),
>>   Aggregation((?mark), MAX, {}, A),
>>   Aggregation((?mark), AVG, {}, A)
>> )
>> plus we rewrite the aggregates with aggi:
>> SELECT ?agg1 (?agg2 AS ?max) (?agg3 AS ?avg)
>> WHERE { ?name ex:mark ?mark } GROUP BY ?name
>> We then translate the expressions:
>> A=Extend(A, ?max, ?agg2)
>> A=Extend(A, ?avg, ?agg3)
>> Finally projection:
>> Projcet(A, {?agg1, ?max, ?avg})
>> 
>> Evaluation the BGP gives 4 solutions:
>> mu1 : ?name ->  ex:Birte, ?mark ->  4
>> mu2 : ?name ->  ex:Birte, ?mark ->  5
>> mu3 : ?name ->  ex:Steve, ?mark ->  3
>> mu4 : ?name ->  ex:Steve ?mark ->  5
>> After grouping we get:
>> { ex:Birte ->  {mu1, mu2}, ex:Steve ->  {mu3, mu4} }
>> The three aggregates give
>> SAMPLE: { (ex:Birte)->ex:Birte, (ex:Steve)->ex:Steve }
>> MAX: { (ex:Birte)->5, (ex:Steve)->5 }
>> AVG: { (ex:Birte)->4.5, (ex:Steve)->4 }
>> Joining the aggregates gives 2 new solutions:
>> mu_a: ?agg1->ex:Birte, ?agg2->5, ?agg3->4.5
>> mu_b : ?agg1->ex:Steve, ?agg2->5, ?agg3->4
>> We extend the solutions:
>> mu_a' = mu_a union ?max->5, ?avg->4.5
>> mu_b' = mu_a union ?max->5, ?avg->4
>> Project then throws out ?agg2 and ?agg3 and we are done.
>> 
>> We have, however, ?agg1 instead of ?name and if we had used HAVING
>> with some condition on ?name we would have a problem.
>> 
>> Do I do something wrong here or do we indeed have ?agg1 in the result
>> and ?name got lost?
>> 
>> Birte
>> 
>> 
>> 
>> 
>> 
>>> 
>>> The way the document is structured moves these apart in an unfortunate way.
>>> 
>>> - Steve
>>> 
>>>> 
>>>> How about having two loops
>>>> For each aggregate (X(args ; scalarvals) AS var) now in E
>>>>        # note scalarvals may be omitted, then it's equivalent to the empty set
>>>>        Ai := Aggregation(args, X, scalarvals, G)
>>>>        Replace X(...) with aggi in Q
>>>>        i := i + 1
>>>>        End
>>>> For each aggregate X(args ; scalarvals) now in E
>>>>        # note scalarvals may be omitted, then it's equivalent to the empty set
>>>>        Ai := Aggregation(args, X, scalarvals, G)
>>>>       Replace X(var; scalarvals) with (aggi AS var) in Q
>>>>        i := i + 1
>>>>        End
>>>> 
>>>> This way, we never have an illegal syntax form, we guarantee that all
>>>> variables are still available after the aggregation and since AS is
>>>> only processed later all seems to be fine. One could of course think
>>>> about handling both cases in one loop although for the spec having two
>>>> loops seems fine to me.
>>>> 
>>>> Birte
>>>> 
>>>>> - Steve
>>>>> 
>>>>>>> There is no definition of "Aggregation".  It's mentioned in 11.2 but the
>>>>>>> link goes to "Definition: Evaluation of Aggregation".  There should a
>>>>>>> definition (just after group?) in 18.4.
>>>>>> 
>>>>>> Yes, I also wondered about that. It is somehow clear how to evaluate,
>>>>>> but it would be much more consistent if there were a definition.
>>>>>> 
>>>>>>> I looked because I wondered if we could just have an "?x" as the
>>>>>>> "aggregate".
>>>>>> 
>>>>>> Not sure I understand this.
>>>>>> 
>>>>>>> But I think, as Birte shows, as because it's done by syntactic
>>>>>>> rewriting, just leaving it as "?x" would work.
>>>>>> 
>>>>>> As I don't understand the sentence above. I just want to make my point
>>>>>> again that we need a binding for ?x if ?x is grouped but not in an
>>>>>> aggregate as it can be used in the HAVING clause. If, at the point of
>>>>>> evaluating HAVING, we only have agg_1, we can't filter on ?x.
>>>>>> 
>>>>>>>> I wanted to convert the plain ?x projection to an aggregate so it was
>>>>>>>> consistent with the rest of the projections, but expressing it explicitly
>>>>>>>> would be equivalent I think.
>>>>>>>> 
>>>>>>>> I will have a run through the aggregation text and see if I can make that
>>>>>>>> change with a relatively small change to the document.
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>>    Steve
>>>>>>> 
>>>>>>> I also noticed;
>>>>>>> 
>>>>>>> [[
>>>>>>> Definition: Evaluation of AggregateJoin
>>>>>>> ...
>>>>>>> Note that if eval(D(G), Ai) is an error, it is ignored.
>>>>>>> ]]
>>>>>>> 
>>>>>>>  An error causes an error doesn't it?  (AS causes it to be unbound)
>>>>>> 
>>>>>> AS is transformed into Extend(), which is evaluated:
>>>>>> Extend(μ, var, expr) = μ ∪ { (var,value) | var not in dom(μ) and value
>>>>>> = eval(expr) }
>>>>>> Extend(μ, var, expr) = μ if var not in dom(μ) and eval(expr) is an error
>>>>>> 
>>>>>> The latter makes the solution just not contain a mapping for the
>>>>>> variable as I understand it.
>>>>>> 
>>>>>> But while we are at it, there is a lowercase extend in the Definition of Extend:
>>>>>> Extend(Ω , var, term) = { extend(μ, var, term) | μ in Ω }
>>>>>> 
>>>>>> It is also lowercase in the evaluation semantics:
>>>>>> Definition: Evaluation of Extend
>>>>>> eval(D(G), extend(var, expr, P)) = extend(var, expr , eval(D(G), P))
>>>>>> Furthermore, here we swap the order. It should be
>>>>>> eval(D(G), Extend(P, var, expr)) = Extend(eval(D(G), P), var, expr)
>>>>>> or the algorithm for translating queries into the algrebra is wrong
>>>>>> and has to be changed.
>>>>>> 
>>>>>> Birte
>>>>>>>        Andy
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
>>>>>> Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
>>>>>> University of Ulm                         Fax:   +49 731 50 24188
>>>>>> D-89069 Ulm                               birte.glimm@uni-ulm.de
>>>>>> Germany
>>>>>> 
>>>>> 
>>>>> --
>>>>> Steve Harris, CTO, Garlik Limited
>>>>> 1-3 Halford Road, Richmond, TW10 6AW, UK
>>>>> +44 20 8439 8203  http://www.garlik.com/
>>>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
>>>> Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
>>>> University of Ulm                         Fax:   +49 731 50 24188
>>>> D-89069 Ulm                               birte.glimm@uni-ulm.de
>>>> Germany
>>>> 
>>> 
>>> --
>>> Steve Harris, CTO, Garlik Limited
>>> 1-3 Halford Road, Richmond, TW10 6AW, UK
>>> +44 20 8439 8203  http://www.garlik.com/
>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>>> 
>>> 
>> 
>> 
>> 
> 

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Monday, 21 November 2011 15:09:08 UTC