Re: grouping by expressions

On 03/11/10 10:59, Steve Harris wrote:
> On 2010-11-03, at 10:05, Andy Seaborne wrote:
>> On 03/11/10 07:11, Steve Harris wrote:
>>> On 3 Nov 2010, at 02:42, Gregory Williams<greg@evilfunhouse.com>   wrote:
>>>
>>>> On Nov 2, 2010, at 5:06 PM, Lee Feigenbaum wrote:
>>>>
>>>>> I believe there are likely three options:
>>>>>
>>>>> 1) To project grouping expressions, use BIND to alias the expression to a variable and then GROUP BY and project that variable (as above).
>>>>>
>>>>> 2) Include an AS aliasing mechanism in GROUP BY, allow that alias to be projected in the SELECT clause
>>>>>
>>>>> 3) Allow SELECT list aliases to be used in the GROUP BY expression
>>>>>
>>>>> Can people please indicate on the mailing list which direction they'd like us to go on this, and we will then wrap this up on next Tuesday's telecon?
>>>>
>>>> 3 seems backwards to me -- not really sure how it would work. I currently implement 2 and am happy with it, but 1 would seem to be reasonable also.
>>>
>>> Agreed that 3 seems odd.
>>>
>>> Preference for 1 as were going to have that mechanism anyway. Allowing AS in GROUP BY as well seems excessive, and will further complicate the algebra in that area.
>>>
>>> - Steve
>>
>> My preference is for 2.
>>
>> It reduces the query author burden and is consistent with the style of explicit naming of used expressions we have in SELECT expressions.
>>
>> This isn't about the algebra - it would be handled during translation from syntax to algebra.
>>
>> GROUP BY (expr AS ?var)
>> ==>
>>   group (?var) .. aggregation pairs
>>     extend (expr ?var)
>>        ....
>>
>> ARQ implements (2).  For SPARQL 1.1, which is the default, it enforces naming with AS; optionally, it will generate variables if needed (extended syntax, no AS).
>
> I don't think we should /require/ AS if we add this syntax, there are situations where you want to group by an expression, but don't need to assign it to a variable, e.g.:
>
> SELECT (AVG(?time) AS ?centre) (COUNT(*) AS ?magnitude)
> WHERE {
>     ?x a<Impulse>  ;
>        <timestamp>  ?time .
> }
> GROUP BY round(?time * 1000)
>
> Would seem a bit strange to have to write GROUP BY (round(?time * 1000) AS ?notneeded).

Fine by me, and seems to make other things work out naturally: have:

   HAVING (COUNT(*) > 0)

then that's much cleaner to handle with an implicit variable for the 
COUNT(*) as it can be used multiple times in the same aggregation step. 
  While it's possible to define it so the aggregation happens multiple 
times, expression evaluation would need to be updated to know about 
aggregation functions.

Having a variable created in algebra generation means that the XSD 
expression evaluation is untouched: everything happens inside a "group" 
algebra operation: definition of the group keys, calculation of aggregates.

So for:

   SELECT (COUNT(*) AS ?c) (2*COUNT(*) AS ?c2) (1/COUNT(*) AS ?d)
   {...}
   HAVING (COUNT(*) > 0)

Allocate a new variable and assign the aggregation calculation once to 
that one new variable.  Rewrite all the expressions to use that var.

Such a variable can't escape and be visible to the results without 
having been given a legal name via AS.

	Andy

Received on Wednesday, 3 November 2010 11:24:54 UTC