Re: Order of evaluation for aggregates from Birte Glimm on 2011-11-17 (public-rdf-dawg@w3.org from October to December 2011)

From: Birte Glimm <birte.glimm@uni-ulm.de>
Date: Thu, 17 Nov 2011 10:00:49 +0100
To: Steve Harris <steve.harris@garlik.com>
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <CABt65OeRpRAhkyHk0CKyU6etxfz=XvivQ_qXTZKejzODdv1POQ@mail.gmail.com>
On 16 November 2011 19:08, Steve Harris <steve.harris@garlik.com> wrote:
> On 2011-11-16, at 16:58, Birte Glimm wrote:
>
>> [snip]
>>>>> As syntax, (SAMPLE(?x) AS ?x) isn't legal because AS has to introduce a new
>>>>> variable. This happens in SELECT expression processing a few subsections.
>>>>
>>>> Yes, that occured to me as well. Unless it is made legal for
>>>> intermediate queries, which I don't like, there seems no way around
>>>> creating solutions in the aggregate join that also contain the grouped
>>>> variables.
>>>
>>> Yes, that why I ended up with the messy agg_i thing, to avoid conflating aggregate results with variable names.
>>>
>>> Reading through this again, I think that the text as written is correct:
>>>
>>>    For each variable V appearing outside of an aggregate
>>>        Replace V with Sample(V) in Q
>>>        End
>>>
>>> ensures that there's only aggregates being projected, then
>>>
>>>    For each aggregate X(args ; scalarvals) now in E
>>>        # note scalarvals may be omitted, then it's equivalent to the empty set
>>>        Ai := Aggregation(args, X, scalarvals, G)
>>>        Replace X(...) with aggi in Q
>>>        i := i + 1
>>>        End
>>>
>>> Defines A_i/agg_i for the Sample(V) above. I could well have spec blindness though.
>>
>> That still does not solve the problem that you loose the original
>> variable name, so results will contain agg_i and even worse, if you
>> have a having clause, the variable used there might no longer exist
>> since it was replaced by agg_i.
>
> I believe that's taken care of by AggregateJoin:
>
> Write A = (A1, A2, ...) where Ai = Aggregation(exprListi, funci, scalarvarsi, P)
>
> eval(D(G), AggregateJoin(A)) = { (agg1, v1), ..., (aggn, vn) | vi such that ( k, vi ) in eval(D(G), Ai)
> for some k and each 1 <= i <= n }
>
> vi is your var below, I think.

I don't think. vi is the aggregated value, so you ( k, vi ) with k the
key that was used to group the values and vi the aggregated value.
I'll do one complete example to make clear where I think we have a
problem:

Assume data:
ex:Birte ex:mark 4, 5 .
ex:Steve ex:mark 3, 5 .
and query:
SELECT ?name (MAX(?mark) AS ?max) (AVG(?mark) AS ?avg)
WHERE { ?name ex:mark ?mark } GROUP BY ?name

Algebra translation up to group gives:
A=Group((?name), Bgp(...))
Replace ?name with SAMPLE(?name)
SELECT SAMPLE(?name) (MAX(?mark) AS ?max) (AVG(?mark) AS ?avg)
WHERE { ?name ex:mark ?mark } GROUP BY ?name
continue transformation:
A= AggregateJoin(
  Aggregation((?name), SAMPLE, {}, A),
  Aggregation((?mark), MAX, {}, A),
  Aggregation((?mark), AVG, {}, A)
)
plus we rewrite the aggregates with aggi:
SELECT ?agg1 (?agg2 AS ?max) (?agg3 AS ?avg)
WHERE { ?name ex:mark ?mark } GROUP BY ?name
We then translate the expressions:
A=Extend(A, ?max, ?agg2)
A=Extend(A, ?avg, ?agg3)
Finally projection:
Projcet(A, {?agg1, ?max, ?avg})

Evaluation the BGP gives 4 solutions:
mu1 : ?name -> ex:Birte, ?mark -> 4
mu2 : ?name -> ex:Birte, ?mark -> 5
mu3 : ?name -> ex:Steve, ?mark -> 3
mu4 : ?name -> ex:Steve ?mark -> 5
After grouping we get:
{ ex:Birte -> {mu1, mu2}, ex:Steve -> {mu3, mu4} }
The three aggregates give
SAMPLE: { (ex:Birte)->ex:Birte, (ex:Steve)->ex:Steve }
MAX: { (ex:Birte)->5, (ex:Steve)->5 }
AVG: { (ex:Birte)->4.5, (ex:Steve)->4 }
Joining the aggregates gives 2 new solutions:
mu_a: ?agg1->ex:Birte, ?agg2->5, ?agg3->4.5
mu_b : ?agg1->ex:Steve, ?agg2->5, ?agg3->4
We extend the solutions:
mu_a' = mu_a union ?max->5, ?avg->4.5
mu_b' = mu_a union ?max->5, ?avg->4
Project then throws out ?agg2 and ?agg3 and we are done.

We have, however, ?agg1 instead of ?name and if we had used HAVING
with some condition on ?name we would have a problem.

Do I do something wrong here or do we indeed have ?agg1 in the result
and ?name got lost?

Birte





>
> The way the document is structured moves these apart in an unfortunate way.
>
> - Steve
>
>>
>> How about having two loops
>> For each aggregate (X(args ; scalarvals) AS var) now in E
>>        # note scalarvals may be omitted, then it's equivalent to the empty set
>>        Ai := Aggregation(args, X, scalarvals, G)
>>        Replace X(...) with aggi in Q
>>        i := i + 1
>>        End
>> For each aggregate X(args ; scalarvals) now in E
>>        # note scalarvals may be omitted, then it's equivalent to the empty set
>>        Ai := Aggregation(args, X, scalarvals, G)
>>       Replace X(var; scalarvals) with (aggi AS var) in Q
>>        i := i + 1
>>        End
>>
>> This way, we never have an illegal syntax form, we guarantee that all
>> variables are still available after the aggregation and since AS is
>> only processed later all seems to be fine. One could of course think
>> about handling both cases in one loop although for the spec having two
>> loops seems fine to me.
>>
>> Birte
>>
>>> - Steve
>>>
>>>>> There is no definition of "Aggregation".  It's mentioned in 11.2 but the
>>>>> link goes to "Definition: Evaluation of Aggregation".  There should a
>>>>> definition (just after group?) in 18.4.
>>>>
>>>> Yes, I also wondered about that. It is somehow clear how to evaluate,
>>>> but it would be much more consistent if there were a definition.
>>>>
>>>>> I looked because I wondered if we could just have an "?x" as the
>>>>> "aggregate".
>>>>
>>>> Not sure I understand this.
>>>>
>>>>> But I think, as Birte shows, as because it's done by syntactic
>>>>> rewriting, just leaving it as "?x" would work.
>>>>
>>>> As I don't understand the sentence above. I just want to make my point
>>>> again that we need a binding for ?x if ?x is grouped but not in an
>>>> aggregate as it can be used in the HAVING clause. If, at the point of
>>>> evaluating HAVING, we only have agg_1, we can't filter on ?x.
>>>>
>>>>>> I wanted to convert the plain ?x projection to an aggregate so it was
>>>>>> consistent with the rest of the projections, but expressing it explicitly
>>>>>> would be equivalent I think.
>>>>>>
>>>>>> I will have a run through the aggregation text and see if I can make that
>>>>>> change with a relatively small change to the document.
>>>>>>
>>>>>> Cheers,
>>>>>>    Steve
>>>>>
>>>>> I also noticed;
>>>>>
>>>>> [[
>>>>> Definition: Evaluation of AggregateJoin
>>>>> ...
>>>>> Note that if eval(D(G), Ai) is an error, it is ignored.
>>>>> ]]
>>>>>
>>>>>  An error causes an error doesn't it?  (AS causes it to be unbound)
>>>>
>>>> AS is transformed into Extend(), which is evaluated:
>>>> Extend(μ, var, expr) = μ ∪ { (var,value) | var not in dom(μ) and value
>>>> = eval(expr) }
>>>> Extend(μ, var, expr) = μ if var not in dom(μ) and eval(expr) is an error
>>>>
>>>> The latter makes the solution just not contain a mapping for the
>>>> variable as I understand it.
>>>>
>>>> But while we are at it, there is a lowercase extend in the Definition of Extend:
>>>> Extend(Ω , var, term) = { extend(μ, var, term) | μ in Ω }
>>>>
>>>> It is also lowercase in the evaluation semantics:
>>>> Definition: Evaluation of Extend
>>>> eval(D(G), extend(var, expr, P)) = extend(var, expr , eval(D(G), P))
>>>> Furthermore, here we swap the order. It should be
>>>> eval(D(G), Extend(P, var, expr)) = Extend(eval(D(G), P), var, expr)
>>>> or the algorithm for translating queries into the algrebra is wrong
>>>> and has to be changed.
>>>>
>>>> Birte
>>>>>        Andy
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
>>>> Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
>>>> University of Ulm                         Fax:   +49 731 50 24188
>>>> D-89069 Ulm                               birte.glimm@uni-ulm.de
>>>> Germany
>>>>
>>>
>>> --
>>> Steve Harris, CTO, Garlik Limited
>>> 1-3 Halford Road, Richmond, TW10 6AW, UK
>>> +44 20 8439 8203  http://www.garlik.com/
>>> Registered in England and Wales 535 7233 VAT # 849 0517 11
>>> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>>>
>>>
>>
>>
>>
>> --
>> Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
>> Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
>> University of Ulm                         Fax:   +49 731 50 24188
>> D-89069 Ulm                               birte.glimm@uni-ulm.de
>> Germany
>>
>
> --
> Steve Harris, CTO, Garlik Limited
> 1-3 Halford Road, Richmond, TW10 6AW, UK
> +44 20 8439 8203  http://www.garlik.com/
> Registered in England and Wales 535 7233 VAT # 849 0517 11
> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
>
>



-- 
Jun. Prof. Dr. Birte Glimm            Tel.:    +49 731 50 24125
Inst. of Artificial Intelligence         Secr:  +49 731 50 24258
University of Ulm                         Fax:   +49 731 50 24188
D-89069 Ulm                               birte.glimm@uni-ulm.de
Germany
Received on Thursday, 17 November 2011 09:01:24 UTC