Re: Order of evaluation for aggregates from Andy Seaborne on 2011-11-21 (public-rdf-dawg@w3.org from October to December 2011)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Mon, 21 Nov 2011 14:28:56 +0000
To: birte.glimm@uni-ulm.de
CC: Steve Harris <steve.harris@garlik.com>, SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <4ECA6028.7030403@epimorphics.com>
Steve,

Could you walk through the translation of:

SELECT ?x {?x :p ?v } GROUP BY ?x

This is the case where there is no AS because the variable is part of 
the group key.

 Andy

On 17/11/11 09:00, Birte Glimm wrote:
> On 16 November 2011 19:08, Steve Harris<steve.harris@garlik.com>  wrote:
>> On 2011-11-16, at 16:58, Birte Glimm wrote:
>>
>>> [snip]
>>>>>> As syntax, (SAMPLE(?x) AS ?x) isn't legal because AS has to introduce a new
>>>>>> variable. This happens in SELECT expression processing a few subsections.
>>>>>
>>>>> Yes, that occured to me as well. Unless it is made legal for
>>>>> intermediate queries, which I don't like, there seems no way around
>>>>> creating solutions in the aggregate join that also contain the grouped
>>>>> variables.
>>>>
>>>> Yes, that why I ended up with the messy agg_i thing, to avoid conflating aggregate results with variable names.
>>>>
>>>> Reading through this again, I think that the text as written is correct:
>>>>
>>>>     For each variable V appearing outside of an aggregate
>>>>         Replace V with Sample(V) in Q
>>>>         End
>>>>
>>>> ensures that there's only aggregates being projected, then
>>>>
>>>>     For each aggregate X(args ; scalarvals) now in E
>>>>         # note scalarvals may be omitted, then it's equivalent to the empty set
>>>>         Ai := Aggregation(args, X, scalarvals, G)
>>>>         Replace X(...) with aggi in Q
>>>>         i := i + 1
>>>>         End
>>>>
>>>> Defines A_i/agg_i for the Sample(V) above. I could well have spec blindness though.
>>>
>>> That still does not solve the problem that you loose the original
>>> variable name, so results will contain agg_i and even worse, if you
>>> have a having clause, the variable used there might no longer exist
>>> since it was replaced by agg_i.
>>
>> I believe that's taken care of by AggregateJoin:
>>
>> Write A = (A1, A2, ...) where Ai = Aggregation(exprListi, funci, scalarvarsi, P)
>>
>> eval(D(G), AggregateJoin(A)) = { (agg1, v1), ..., (aggn, vn) | vi such that ( k, vi ) in eval(D(G), Ai)
>> for some k and each 1<= i<= n }
>>
>> vi is your var below, I think.
>
> I don't think. vi is the aggregated value, so you ( k, vi ) with k the
> key that was used to group the values and vi the aggregated value.
> I'll do one complete example to make clear where I think we have a
> problem:
>
> Assume data:
> ex:Birte ex:mark 4, 5 .
> ex:Steve ex:mark 3, 5 .
> and query:
> SELECT ?name (MAX(?mark) AS ?max) (AVG(?mark) AS ?avg)
> WHERE { ?name ex:mark ?mark } GROUP BY ?name
>
> Algebra translation up to group gives:
> A=Group((?name), Bgp(...))
> Replace ?name with SAMPLE(?name)
> SELECT SAMPLE(?name) (MAX(?mark) AS ?max) (AVG(?mark) AS ?avg)
> WHERE { ?name ex:mark ?mark } GROUP BY ?name
> continue transformation:
> A= AggregateJoin(
>    Aggregation((?name), SAMPLE, {}, A),
>    Aggregation((?mark), MAX, {}, A),
>    Aggregation((?mark), AVG, {}, A)
> )
> plus we rewrite the aggregates with aggi:
> SELECT ?agg1 (?agg2 AS ?max) (?agg3 AS ?avg)
> WHERE { ?name ex:mark ?mark } GROUP BY ?name
> We then translate the expressions:
> A=Extend(A, ?max, ?agg2)
> A=Extend(A, ?avg, ?agg3)
> Finally projection:
> Projcet(A, {?agg1, ?max, ?avg})
>
> Evaluation the BGP gives 4 solutions:
> mu1 : ?name ->  ex:Birte, ?mark ->  4
> mu2 : ?name ->  ex:Birte, ?mark ->  5
> mu3 : ?name ->  ex:Steve, ?mark ->  3
> mu4 : ?name ->  ex:Steve ?mark ->  5
> After grouping we get:
> { ex:Birte ->  {mu1, mu2}, ex:Steve ->  {mu3, mu4} }
> The three aggregates give
> SAMPLE: { (ex:Birte)->ex:Birte, (ex:Steve)->ex:Steve }
> MAX: { (ex:Birte)->5, (ex:Steve)->5 }
> AVG: { (ex:Birte)->4.5, (ex:Steve)->4 }
> Joining the aggregates gives 2 new solutions:
> mu_a: ?agg1->ex:Birte, ?agg2->5, ?agg3->4.5
> mu_b : ?agg1->ex:Steve, ?agg2->5, ?agg3->4
> We extend the solutions:
> mu_a' = mu_a union ?max->5, ?avg->4.5
> mu_b' = mu_a union ?max->5, ?avg->4
> Project then throws out ?agg2 and ?agg3 and we are done.
>
> We have, however, ?agg1 instead of ?name and if we had used HAVING
> with some condition on ?name we would have a problem.
>
> Do I do something wrong here or do we indeed have ?agg1 in the result
> and ?name got lost?
>
> Birte
>
>
>
>
>
>>
>> The way the document is structured moves these apart in an unfortunate way.
>>
>> - Steve
>>
>>>
>>> How about having two loops
>>> For each aggregate (X(args ; scalarvals) AS var) now in E
>>>         # note scalarvals may be omitted, then it's equivalent to the empty set
>>>         Ai := Aggregation(args, X, scalarvals, G)
>>>         Replace X(...) with aggi in Q
>>>         i := i + 1
>>>         End
>>> For each aggregate X(args ; scalarvals) now in E
>>>         # note scalarvals may be omitted, then it's equivalent to the empty set
>>>         Ai := Aggregation(args, X, scalarvals, G)
>>>        Replace X(var; scalarvals) with (aggi AS var) in Q
>>>         i := i + 1
>>>         End
>>>
>>> This way, we never have an illegal syntax form, we guarantee that all
>>> variables are still available after the aggregation and since AS is
>>> only processed later all seems to be fine. One could of course think
>>> about handling both cases in one loop although for the spec having two
>>> loops seems fine to me.
>>>
>>> Birte
>>>
>>>> - Steve
>>>>
>>>>>> There is no definition of "Aggregation".  It's mentioned in 11.2 but the
>>>>>> link goes to "Definition: Evaluation of Aggregation".  There should a
>>>>>> definition (just after group?) in 18.4.
>>>>>
>>>>> Yes, I also wondered about that. It is somehow clear how to evaluate,
>>>>> but it would be much more consistent if there were a definition.
>>>>>
>>>>>> I looked because I wondered if we could just have an "?x" as the
>>>>>> "aggregate".
>>>>>
>>>>> Not sure I understand this.
>>>>>
>>>>>> But I think, as Birte shows, as because it's done by syntactic
>>>>>> rewriting, just leaving it as "?x" would work.
>>>>>
>>>>> As I don't understand the sentence above. I just want to make my point
>>>>> again that we need a binding for ?x if ?x is grouped but not in an
>>>>> aggregate as it can be used in the HAVING clause. If, at the point of
>>>>> evaluating HAVING, we only have agg_1, we can't filter on ?x.
>>>>>
>>>>>>> I wanted to convert the plain ?x projection to an aggregate so it was
>>>>>>> consistent with the rest of the projections, but expressing it explicitly
>>>>>>> would be equivalent I think.
>>>>>>>
>>>>>>> I will have a run through the aggregation text and see if I can make that
>>>>>>> change with a relatively small change to the document.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>     Steve
>>>>>>
>>>>>> I also noticed;
>>>>>>
>>>>>> [[
>>>>>> Definition: Evaluation of AggregateJoin
>>>>>> ...
>>>>>> Note that if eval(D(G), Ai) is an error, it is ignored.
>>>>>> ]]
>>>>>>
>>>>>>   An error causes an error doesn't it?  (AS causes it to be unbound)
>>>>>
>>>>> AS is transformed into Extend(), which is evaluated:
>>>>> Extend(μ, var, expr) = μ ∪ { (var,value) | var not in dom(μ) and value
>>>>> = eval(expr) }
>>>>> Extend(μ, var, expr) = μ if var not in dom(μ) and eval(expr) is an error
>>>>>
>>>>> The latter makes the solution just not contain a mapping for the
>>>>> variable as I understand it.
>>>>>
>>>>> But while we are at it, there is a lowercase extend in the Definition of Extend:
>>>>> Extend(Ω , var, term) = { extend(μ, var, term) | μ in Ω }
>>>>>
>>>>> It is also lowercase in the evaluation semantics:
>>>>> Definition: Evaluation of Extend
>>>>> eval(D(G), extend(var, expr, P)) = extend(var, expr , eval(D(G), P))
>>>>> Furthermore, here we swap the order. It should be
>>>>> eval(D(G), Extend(P, var, expr)) = Extend(eval(D(G), P), var, expr)
>>>>> or the algorithm for translating queries into the algrebra is wrong
>>>>> and has to be changed.
>>>>>
>>>>> Birte
>>>>>>         Andy
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
>>>>> Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
>>>>> University of Ulm                         Fax:   +49 731 50 24188
>>>>> D-89069 Ulm                               birte.glimm@uni-ulm.de
>>>>> Germany
>>>>>
>>>>
>>>> --
>>>> Steve Harris, CTO, Garlik Limited
>>>> 1-3 Halford Road, Richmond, TW10 6AW, UK
>>>> +44 20 8439 8203  http://www.garlik.com/
>>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
>>> Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
>>> University of Ulm                         Fax:   +49 731 50 24188
>>> D-89069 Ulm                               birte.glimm@uni-ulm.de
>>> Germany
>>>
>>
>> --
>> Steve Harris, CTO, Garlik Limited
>> 1-3 Halford Road, Richmond, TW10 6AW, UK
>> +44 20 8439 8203  http://www.garlik.com/
>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>>
>>
>
>
>
Received on Monday, 21 November 2011 14:29:40 UTC