Re: grouping by expressions from Andy Seaborne on 2010-11-03 (public-rdf-dawg@w3.org from October to December 2010)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Wed, 03 Nov 2010 13:15:43 +0000
To: Steve Harris <steve.harris@garlik.com>
CC: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <4CD1607F.7050709@epimorphics.com>

>> Having a variable created in algebra generation means that the XSD expression evaluation is untouched: everything happens inside a "group" algebra operation: definition of the group keys, calculation of aggregates.
>
> This doesn't require an implicit variable, as far as I can see.
>
> Aggregate() as it's written now in the draft can handle GROUP BY (expression) without an implicit variable.

I think you mean Group(), not Aggregate().

I wasn't mean that "group" - I should have chosen a different name. 
Your "Aggregation" would be closer but grouping does not always involve 
aggregation.

 > If you just omit all mention of a variable in the spec text then it's 
clearer, no?

Not for me.
The example in the WD is:

[[
And so Aggregation((?y, ?z), ex:agg, {}, G) =

{ (1) → eg:agg({(2, 3), (3, 4)}, {})), (2) → eg:agg({(5, 6)}, {}) }.
]]

Now if the query is (slight modification alert):

   SELECT (ex:agg(?y, ?z)+1 AS ?agg)
   WHERE { ?x ?y ?z }
   GROUP BY ?x.

how do we get from

{ (1) → eg:agg({(2, 3), (3, 4)}, {})), (2) → eg:agg({(5, 6)}, {}) }.

to being able to calculate ex:agg(?y, ?z)+1  ?

ex:agg(?y, ?z)+1 is an expression - it needs a solution to calculate the 
"+".

The easy way is to keep a clean separation of the group/aggregate 
process and the expression evaluation of SELECT expressions and HAVING.

Given:
{ (1) → eg:agg({(2, 3), (3, 4)}, {})), (2) → eg:agg({(5, 6)}, {}) }.

## Not sure what the {} are.

Let's take ex:agg to be MAX(?y), and evaluate it in Aggregation()

{ (1) → MAX({(2, 3), (3, 4)})), (2) → MAX({(5, 6)}) }.
=
{ (1) → 3, (2) → 5 }.

This is not a solution binding and can't be used in a expression.

Let's now take:
   SELECT (MAX(?y)+1 AS ?agg)

Let's call, internally only, MAX(?y)=?N because solutions are a mapping 
from variables to values so we need a variable to associate the value of 
MAX(?y) with.

then
   (MAX(?y)+1 AS ?agg)
is translated to
   extend (?N+1 AS ?agg)

the output of the group/aggregate step is a table (multiset of bindings):

?x  MAX(?y)
1   3
2   5

Can't write MAX(?y) directly into "extend (?N+1 AS ?agg)"
for several reasons:

1/ MAX(?y) isn't function that results a single value.

2/ ?y is out-of-scope by then

3/ there isn't the information of a group key so can't look up the group 
key to get the value - can't get to the (1)

We could call the variable ?"MAX(?y)" but making it unique across the 
whole query seems easier and there could be another, different, MAX(?y).

> That the bit you have to be careful around. If you just omit all mention of a variable in the spec text then it's clearer, no?

?N can't escape - the only ways out are via projection so either it's 
explicit names, including AS, and the original syntax of SELECT * 
further out in the query.

But define SELECT * during the algebra translation as only finding 
variables used in the syntax, and it can't be ?N.

(ARQ uses illegal variables names so it can easily determine the class 
of variable later - convenience, not need).

Greg - how do you do it?

 Andy

Received on Wednesday, 3 November 2010 13:16:25 UTC