# Re: Aggregates

From: Steve Harris <steve.harris@garlik.com>
Date: Fri, 9 Dec 2011 11:52:08 +0000

Message-Id: <A4B8FA7C-3968-4CB7-8061-AED1E81D4ED5@garlik.com>
To: Andy Seaborne <andy.seaborne@epimorphics.com>
```On 2011-12-09, at 10:53, Andy Seaborne wrote:

>
>
> On 09/12/11 10:26, Steve Harris wrote:
>> On 2011-12-08, at 09:02, Andy Seaborne wrote:
>>
>>>
>>>
>>> On 07/12/11 13:57, Steve Harris wrote:
>>>> On 2011-12-07, at 13:00, Andy Seaborne wrote:
>>>>
>>>>>
>>>>>
>>>>> On 06/12/11 22:40, Steve Harris wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I've now got the aggregates in a state where I think all the information is carried through from one end of the query to the other… but I've thought that before :)
>>>>>>
>>>>>> I also think ORDER BY is covered.
>>>>>>
>>>>>> Here's a sketch of what I think should be happening:
>>>>>>
>>>>>> Data
>>>>>>
>>>>>> <a>    <p>    1 .
>>>>>> <a>    <p>    2 .
>>>>>> <b>    <p>    3 .
>>>>>>
>>>>>> Query
>>>>>>
>>>>>> SELECT (MAX(?o) AS ?max) (MIN(?o) AS ?min)
>>>>>> WHERE { ?s ?p ?o }
>>>>>> GROUP BY ?s
>>>>>> ORDER BY AVG(?o)
>>>>>>
>>>>>>
>>>>>> Ω = Sol  ?s   ?p   ?o
>>>>>>     μ1<a>     <p>     1
>>>>>>     μ2<a>     <p>     2
>>>>>>     μ3<b>     <p>     3
>>>>>>
>>>>>> G = Group((?s), Ω)
>>>>>>   = { ((<a>), { μ1, μ2 }), ((<b>), { μ3 }) }
>>>>>>
>>>>>> Q = SELECT agg1 agg2
>>>>>>     WHERE { ?s ?p ?o }
>>>>>>     GROUP BY ?s
>>>>>>     ORDER BY agg3
>>>>>>
>>>>>> E = { (?max, agg1), (?min, agg2) }
>>>>>>
>>>>>> A1 = Aggregation((?o), Max, {}, G)
>>>>>> A2 = Aggregation((?o), Min, {}, G)
>>>>>> A3 = Aggregation((?o), Avg, {}, G)
>>>>>>
>>>>>> J = AggregateJoin(A) =
>>>>>>   { { (agg1, 2), (agg2, 1), (agg3, 1.5) }
>>>>>>     { (agg1, 3), (agg2, 3), (agg3, 3) } }
>>>>>
>>>>>
>>>>> This is the evaluation of AggregateJoin at execution time.
>>>>>
>>>>> I don't understand this step: how does it know the variables are agg1, agg2, and agg3? There could be other agg_i from other query levels. And why this order not agg3, agg2, agg1?
>>>>
>>>> From the A, A has members 1, 2, and 3 in this case. A1 pairs with agg1 for e.g.
>>>>
>>>> If it were a lower query level it might have members 4, 5, and 6 for e.g.
>>>
>>> Not quite: i is reset on every SELECT processed
>>> """
>>>  # Note, i is global for the query, defaults to 1
>>>  Let i := 1
>>> """
>>>
>>> The comment might have that intent, but, to me, "Let" introduces a variable each time.
>>
>> Well, I can remove the word Let if that helps.
>
> Rewording that would be a great help. (I'm not sure what you intend by "default" ) Why not
>
> global i := 1   # Initially 1 for each query processed

That works for me, fixed. I resisted the urge to use the word Static ;)

>>> It is workable as a definition but rather unclear to me.  The fact it works relies on scoping features of variables so that the use of agg_1 twice does not fall apart.
>>
>> But, that's true of all the pseudocode in the document.
>
> Not at all.  Other forms generated pass the variables around e.g. SELECT expressions generate "Extend(?, var, expr)" forms so the variable is there.
>
>>> Also, the variable names are regenerated.  How does AggregateJoin know the variable is called "agg1" not "__gen1" because the query really does use ?agg1 in teh user written part?
>>
>> Well, it's explicit in the text - I don't really understand how this is an issue - users have no way to create a variable called agg<sub>1</sub>, so their can't be any conflict.
>
> This worries me.  Why can't a user accidentally write the name of an aggregate?

Not having a ? prefix, and subscript–1 isn't legal in a variable name, if I understand it correctly.

> Either (1) the aggregate code generates a variable named in a way that query syntax can't create or (2) the agg variable needs to be named based on the rest of the query.
>
> If it's (2), then how does the eval of AggregateJoin know that choice?

It's (1). If I'm following the algebra correctly then the use can't cause something like Extend(X, agg<sub>i</sub>, y) to occur, so it can't collide.

If that's too subtle, or wrong, maybe I should add a prefix or suffix?

If I'm following the grammar correctly variable names can't include a ' character for e.g., I could add that?

>>> I think it would have been easier to do it all in the translation and not have AggregateJoin which is really just a form of "extend" assigning Ai to agg_i.
>>
>> It's more complex than that - it has to collapse the groups into a solution multiset in order to fit with the rest of the algebra.
>
> Could you please point to the text for that because as written currently its the Aggregate function, each one, not AggregateJoin, that does the collapse

Ah, indeed, my mistake.

So it could be something like:

E_0 = {}
E_i = Extend(E_n-1, agg_i, Aggregation(exprList_i, func_i, scalarvars_i, P))
for each 1 <= i <= n

That certainly looks simpler - if it's legitimate.

- Steve

> -------
> Definition: AggregateJoin
>
> AggregateJoin, a function which takes a set of aggregations, and returns a multiset of solutions.
> -------
> Definition: Evaluation of Aggregation
> eval(D(G), Aggregation(exprlist, func, scalarvals, P)) = { (dom(g), F) | g in eval(D(G), P) }
>
> P as produced by the Group function
> -------
>
>
> and then:
> -------
> Definition: Evaluation of Aggregation
> eval(D(G), AggregateJoin(A)) = { (agg1, v1), ..., (aggn, vn) | vi such that ( k, vi ) in eval(D(G), Ai)
> for some k and each 1 <= i <= n }
> -------
>
> so AggregateJoin works on the value produced by eval(D(G), Ai)
>
> 	Andy

--
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11