Re: Order of evaluation for aggregates from Birte Glimm on 2011-11-21 (public-rdf-dawg@w3.org from October to December 2011)

From: Birte Glimm <birte.glimm@uni-ulm.de>
Date: Mon, 21 Nov 2011 22:18:42 +0100
To: Steve Harris <steve.harris@garlik.com>
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <CABt65OcZF3GH2Lyq6PX54KMMi0sgzSd=kboPzKadD-CLoNSLYA@mail.gmail.com>
On 21 November 2011 16:08, Steve Harris <steve.harris@garlik.com> wrote:
> Yeah, I'm trying to swap this stuff in. I believe Birte has demonstrated that it doesn't join up, I just need to find a solution that doesn't violate any of the other rules.

Can I repeat my latest fix/suggestion? I don't see that it invalidates
any other constraints.

>>>> How about having two loops
>>>> For each aggregate (X(args ; scalarvals) AS var) now in E
>>>>        # note scalarvals may be omitted, then it's equivalent to the empty set
>>>>        Ai := Aggregation(args, X, scalarvals, G)
>>>>        Replace X(...) with aggi in Q
>>>>        i := i + 1
>>>>        End
>>>> For each aggregate X(args ; scalarvals) now in E
>>>>        # note scalarvals may be omitted, then it's equivalent to the empty set
>>>>        Ai := Aggregation(args, X, scalarvals, G)
>>>>       Replace X(var; scalarvals) with (aggi AS var) in Q
>>>>        i := i + 1
>>>>        End
>>>>
>>>> This way, we never have an illegal syntax form, we guarantee that all
>>>> variables are still available after the aggregation and since AS is
>>>> only processed later all seems to be fine. One could of course think
>>>> about handling both cases in one loop although for the spec having two
>>>> loops seems fine to me.

Birte

> - Steve
>
> On 2011-11-21, at 14:28, Andy Seaborne wrote:
>
>> Steve,
>>
>> Could you walk through the translation of:
>>
>> SELECT ?x {?x :p ?v } GROUP BY ?x
>>
>> This is the case where there is no AS because the variable is part of the group key.
>>
>>       Andy
>>
>> On 17/11/11 09:00, Birte Glimm wrote:
>>> On 16 November 2011 19:08, Steve Harris<steve.harris@garlik.com>  wrote:
>>>> On 2011-11-16, at 16:58, Birte Glimm wrote:
>>>>
>>>>> [snip]
>>>>>>>> As syntax, (SAMPLE(?x) AS ?x) isn't legal because AS has to introduce a new
>>>>>>>> variable. This happens in SELECT expression processing a few subsections.
>>>>>>>
>>>>>>> Yes, that occured to me as well. Unless it is made legal for
>>>>>>> intermediate queries, which I don't like, there seems no way around
>>>>>>> creating solutions in the aggregate join that also contain the grouped
>>>>>>> variables.
>>>>>>
>>>>>> Yes, that why I ended up with the messy agg_i thing, to avoid conflating aggregate results with variable names.
>>>>>>
>>>>>> Reading through this again, I think that the text as written is correct:
>>>>>>
>>>>>>    For each variable V appearing outside of an aggregate
>>>>>>        Replace V with Sample(V) in Q
>>>>>>        End
>>>>>>
>>>>>> ensures that there's only aggregates being projected, then
>>>>>>
>>>>>>    For each aggregate X(args ; scalarvals) now in E
>>>>>>        # note scalarvals may be omitted, then it's equivalent to the empty set
>>>>>>        Ai := Aggregation(args, X, scalarvals, G)
>>>>>>        Replace X(...) with aggi in Q
>>>>>>        i := i + 1
>>>>>>        End
>>>>>>
>>>>>> Defines A_i/agg_i for the Sample(V) above. I could well have spec blindness though.
>>>>>
>>>>> That still does not solve the problem that you loose the original
>>>>> variable name, so results will contain agg_i and even worse, if you
>>>>> have a having clause, the variable used there might no longer exist
>>>>> since it was replaced by agg_i.
>>>>
>>>> I believe that's taken care of by AggregateJoin:
>>>>
>>>> Write A = (A1, A2, ...) where Ai = Aggregation(exprListi, funci, scalarvarsi, P)
>>>>
>>>> eval(D(G), AggregateJoin(A)) = { (agg1, v1), ..., (aggn, vn) | vi such that ( k, vi ) in eval(D(G), Ai)
>>>> for some k and each 1<= i<= n }
>>>>
>>>> vi is your var below, I think.
>>>
>>> I don't think. vi is the aggregated value, so you ( k, vi ) with k the
>>> key that was used to group the values and vi the aggregated value.
>>> I'll do one complete example to make clear where I think we have a
>>> problem:
>>>
>>> Assume data:
>>> ex:Birte ex:mark 4, 5 .
>>> ex:Steve ex:mark 3, 5 .
>>> and query:
>>> SELECT ?name (MAX(?mark) AS ?max) (AVG(?mark) AS ?avg)
>>> WHERE { ?name ex:mark ?mark } GROUP BY ?name
>>>
>>> Algebra translation up to group gives:
>>> A=Group((?name), Bgp(...))
>>> Replace ?name with SAMPLE(?name)
>>> SELECT SAMPLE(?name) (MAX(?mark) AS ?max) (AVG(?mark) AS ?avg)
>>> WHERE { ?name ex:mark ?mark } GROUP BY ?name
>>> continue transformation:
>>> A= AggregateJoin(
>>>   Aggregation((?name), SAMPLE, {}, A),
>>>   Aggregation((?mark), MAX, {}, A),
>>>   Aggregation((?mark), AVG, {}, A)
>>> )
>>> plus we rewrite the aggregates with aggi:
>>> SELECT ?agg1 (?agg2 AS ?max) (?agg3 AS ?avg)
>>> WHERE { ?name ex:mark ?mark } GROUP BY ?name
>>> We then translate the expressions:
>>> A=Extend(A, ?max, ?agg2)
>>> A=Extend(A, ?avg, ?agg3)
>>> Finally projection:
>>> Projcet(A, {?agg1, ?max, ?avg})
>>>
>>> Evaluation the BGP gives 4 solutions:
>>> mu1 : ?name ->  ex:Birte, ?mark ->  4
>>> mu2 : ?name ->  ex:Birte, ?mark ->  5
>>> mu3 : ?name ->  ex:Steve, ?mark ->  3
>>> mu4 : ?name ->  ex:Steve ?mark ->  5
>>> After grouping we get:
>>> { ex:Birte ->  {mu1, mu2}, ex:Steve ->  {mu3, mu4} }
>>> The three aggregates give
>>> SAMPLE: { (ex:Birte)->ex:Birte, (ex:Steve)->ex:Steve }
>>> MAX: { (ex:Birte)->5, (ex:Steve)->5 }
>>> AVG: { (ex:Birte)->4.5, (ex:Steve)->4 }
>>> Joining the aggregates gives 2 new solutions:
>>> mu_a: ?agg1->ex:Birte, ?agg2->5, ?agg3->4.5
>>> mu_b : ?agg1->ex:Steve, ?agg2->5, ?agg3->4
>>> We extend the solutions:
>>> mu_a' = mu_a union ?max->5, ?avg->4.5
>>> mu_b' = mu_a union ?max->5, ?avg->4
>>> Project then throws out ?agg2 and ?agg3 and we are done.
>>>
>>> We have, however, ?agg1 instead of ?name and if we had used HAVING
>>> with some condition on ?name we would have a problem.
>>>
>>> Do I do something wrong here or do we indeed have ?agg1 in the result
>>> and ?name got lost?
>>>
>>> Birte
>>>
>>>
>>>
>>>
>>>
>>>>
>>>> The way the document is structured moves these apart in an unfortunate way.
>>>>
>>>> - Steve
>>>>
>>>>>
>>>>> How about having two loops
>>>>> For each aggregate (X(args ; scalarvals) AS var) now in E
>>>>>        # note scalarvals may be omitted, then it's equivalent to the empty set
>>>>>        Ai := Aggregation(args, X, scalarvals, G)
>>>>>        Replace X(...) with aggi in Q
>>>>>        i := i + 1
>>>>>        End
>>>>> For each aggregate X(args ; scalarvals) now in E
>>>>>        # note scalarvals may be omitted, then it's equivalent to the empty set
>>>>>        Ai := Aggregation(args, X, scalarvals, G)
>>>>>       Replace X(var; scalarvals) with (aggi AS var) in Q
>>>>>        i := i + 1
>>>>>        End
>>>>>
>>>>> This way, we never have an illegal syntax form, we guarantee that all
>>>>> variables are still available after the aggregation and since AS is
>>>>> only processed later all seems to be fine. One could of course think
>>>>> about handling both cases in one loop although for the spec having two
>>>>> loops seems fine to me.
>>>>>
>>>>> Birte
>>>>>
>>>>>> - Steve
>>>>>>
>>>>>>>> There is no definition of "Aggregation".  It's mentioned in 11.2 but the
>>>>>>>> link goes to "Definition: Evaluation of Aggregation".  There should a
>>>>>>>> definition (just after group?) in 18.4.
>>>>>>>
>>>>>>> Yes, I also wondered about that. It is somehow clear how to evaluate,
>>>>>>> but it would be much more consistent if there were a definition.
>>>>>>>
>>>>>>>> I looked because I wondered if we could just have an "?x" as the
>>>>>>>> "aggregate".
>>>>>>>
>>>>>>> Not sure I understand this.
>>>>>>>
>>>>>>>> But I think, as Birte shows, as because it's done by syntactic
>>>>>>>> rewriting, just leaving it as "?x" would work.
>>>>>>>
>>>>>>> As I don't understand the sentence above. I just want to make my point
>>>>>>> again that we need a binding for ?x if ?x is grouped but not in an
>>>>>>> aggregate as it can be used in the HAVING clause. If, at the point of
>>>>>>> evaluating HAVING, we only have agg_1, we can't filter on ?x.
>>>>>>>
>>>>>>>>> I wanted to convert the plain ?x projection to an aggregate so it was
>>>>>>>>> consistent with the rest of the projections, but expressing it explicitly
>>>>>>>>> would be equivalent I think.
>>>>>>>>>
>>>>>>>>> I will have a run through the aggregation text and see if I can make that
>>>>>>>>> change with a relatively small change to the document.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>    Steve
>>>>>>>>
>>>>>>>> I also noticed;
>>>>>>>>
>>>>>>>> [[
>>>>>>>> Definition: Evaluation of AggregateJoin
>>>>>>>> ...
>>>>>>>> Note that if eval(D(G), Ai) is an error, it is ignored.
>>>>>>>> ]]
>>>>>>>>
>>>>>>>>  An error causes an error doesn't it?  (AS causes it to be unbound)
>>>>>>>
>>>>>>> AS is transformed into Extend(), which is evaluated:
>>>>>>> Extend(μ, var, expr) = μ ∪ { (var,value) | var not in dom(μ) and value
>>>>>>> = eval(expr) }
>>>>>>> Extend(μ, var, expr) = μ if var not in dom(μ) and eval(expr) is an error
>>>>>>>
>>>>>>> The latter makes the solution just not contain a mapping for the
>>>>>>> variable as I understand it.
>>>>>>>
>>>>>>> But while we are at it, there is a lowercase extend in the Definition of Extend:
>>>>>>> Extend(Ω , var, term) = { extend(μ, var, term) | μ in Ω }
>>>>>>>
>>>>>>> It is also lowercase in the evaluation semantics:
>>>>>>> Definition: Evaluation of Extend
>>>>>>> eval(D(G), extend(var, expr, P)) = extend(var, expr , eval(D(G), P))
>>>>>>> Furthermore, here we swap the order. It should be
>>>>>>> eval(D(G), Extend(P, var, expr)) = Extend(eval(D(G), P), var, expr)
>>>>>>> or the algorithm for translating queries into the algrebra is wrong
>>>>>>> and has to be changed.
>>>>>>>
>>>>>>> Birte
>>>>>>>>        Andy
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
>>>>>>> Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
>>>>>>> University of Ulm                         Fax:   +49 731 50 24188
>>>>>>> D-89069 Ulm                               birte.glimm@uni-ulm.de
>>>>>>> Germany
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Steve Harris, CTO, Garlik Limited
>>>>>> 1-3 Halford Road, Richmond, TW10 6AW, UK
>>>>>> +44 20 8439 8203  http://www.garlik.com/
>>>>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>>>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
>>>>> Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
>>>>> University of Ulm                         Fax:   +49 731 50 24188
>>>>> D-89069 Ulm                               birte.glimm@uni-ulm.de
>>>>> Germany
>>>>>
>>>>
>>>> --
>>>> Steve Harris, CTO, Garlik Limited
>>>> 1-3 Halford Road, Richmond, TW10 6AW, UK
>>>> +44 20 8439 8203  http://www.garlik.com/
>>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>>>>
>>>>
>>>
>>>
>>>
>>
>
> --
> Steve Harris, CTO, Garlik Limited
> 1-3 Halford Road, Richmond, TW10 6AW, UK
> +44 20 8439 8203  http://www.garlik.com/
> Registered in England and Wales 535 7233 VAT # 849 0517 11
> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>
>



-- 
Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
University of Ulm                         Fax:   +49 731 50 24188
D-89069 Ulm                               birte.glimm@uni-ulm.de
Germany
Received on Monday, 21 November 2011 21:19:18 UTC