Re: another aggregates test case... from Steve Harris on 2010-06-09 (public-rdf-dawg@w3.org from April to June 2010)

From: Steve Harris <steve.harris@garlik.com>
Date: Wed, 9 Jun 2010 10:08:26 +0100
To: Andy Seaborne <andy.seaborne@talis.com>
Cc: Lee Feigenbaum <lee@thefigtrees.net>, Axel Polleres <axel.polleres@deri.org>, SPARQL Working Group <public-rdf-dawg@w3.org>
Message-Id: <E340D485-DF93-4657-AFCB-6C059F3B1B97@garlik.com>
On 2010-06-08, at 17:59, Andy Seaborne wrote:
> On 08/06/2010 5:05 PM, Steve Harris wrote:
>> On 2010-06-08, at 16:47, Andy Seaborne wrote:
>>> 
>>> On 08/06/2010 3:12 PM, Lee Feigenbaum wrote:
>>>> On 6/8/2010 10:04 AM, Andy Seaborne wrote:
>>>>> I don't see why it needs to be an error - with no aggregation GROUP BY
>>>>> can be considered to be a a partial sort. Cardinality same as without
>>>>> GROUP BY. This also happens to be a requirement in some apps - results
>>>>> clustered by key but the same number of rows as without grouping.
>>>>> Sorting can make it so, but sorting is potentially more expensive.
>>>> 
>>>> This sounds like a pretty different model of aggregation then we have
>>>> now. (Actually sounds similar to the model that was proposed on the
>>>> comments list a few months ago.) If we went this way, why not do this
>>>> all the time, and just repeat the values for the aggregate calculations?
>>>> 
>>>> I prefer to keep the existing aggregate model.
>>>> 
>>>> Lee
>>> 
>>> I'm not happy with the error case when GROUP BY is used and no aggregate is explicitly mentioned.
>> 
>> Well, the rule is something like you can only project expressions if they're exactly a variable, and match the GROUP BY expression. Otherwise it has to be an aggregate expression.
> 
> In Lee's example:
> 
> SELECT ?v1 ?v2 ?v3
> { ... }
> GROUP BY ?v1 ?v2 ?v3
> 
> is exactly variables and match the GROUP BY expression isn't it?

Yes, true, it was a long day!

> [[ rq25.xml#aggregateExample
> In aggregate queries and sub-queries only expressions which have been
> used as GROUP BY expressions, or aggregated expressions (i.e.
> expressions where all variables appear inside an aggregate) can be
> projected. In order to project arbitrary expressionsthe SAMPLE
> aggregate may be used.
> ]]
> 
> The example was
> 
> GROUP BY ?v1 ?v2 ?v3
> 
> which can expose ?v1 ?v2 ?v3 can't it?

Yes, or any expression using a subset of those.

> The text seems to allow even this:
> 
> SELECT (?x+?y AS ?Z)
> { ... }
> GROUP BY (?x+?y)
> 
> because it only mentions "expressions" which seems quite generous.

It was meant to convey expressions in the project expressions only, so that needs some rewording.

> What about "SELECT (?y+?x AS ?Z)"?
> 
> I'd like to know if I'm reading the text correctly but I'm happy with the current text on projecting expressions from GROUP BY - I can see why restricting to variables may be preferred.

Otherwise we end up with an implicit SAMPLE() which the group was not keen on.

> What's wrong with these queries:
> 
> # Group by ?s and project ?s with count.
> SELECT ?s (Count(*) AS ?C)
> {
>   ?s ?p ?p
> } GROUP BY ?s

Nothing, ?C is produced with an aggregate, so it's legitimate.

> # Group by ?s and ?p but only project ?s and the group(key ?s ?p) count
> SELECT ?s (Count(*) AS ?C)
> {
>   ?s ?p ?p
> } GROUP BY ?s ?p

Again, that's legit.

> # Group by ?s and ?p and project ?s and ?p for each group as
> # as well as the count
> SELECT ?s ?p (Count(*) AS ?C)
> {
>   ?s ?p ?p
> } GROUP BY ?s ?p

Ditto.

> # Group by ?s and ?p and project ?s and ?p for each group as
> # as well as the count
> # Not so much use but seems legal by the text and is well-definable.
> SELECT ?s ?p (Count(*) AS ?C)
> {
>   ?s ?p ?p
> } GROUP BY ?s ?p

There are uses for that, and it's legit.

> which leads me to a fairly natural interpretation of
> 
> SELECT ?s ?p
> {
>   ?s ?p ?p
> } GROUP BY ?s ?p
> 
> as "null aggregation"

I don't understand the term "null aggregation".

>>> Seems useful in developing queries and makes aggregation reasonably orthogonal to grouping.
>>> 
>>> SELECT * means all the keys (i.e. variables in scope after grouping)
>> 
>> That seems fairly rational/sensible, but a significant departure from the meaning of * in non-aggregated queries.
> 
> For me, it's the natural meaning of "*" in "SELECT *" is all visible variables.  That covers SPARQL 1.0, subqueries and grouping. It is also the same algorithm for a syntax check of scoping GROUP BY and SELECT as above but maybe I don't understand that properly.

Sure, I think it's rational, it's just that * is defined differently in SPARQL 1.0: "SELECT * is an abbreviation that selects all of the variables in a query". 

We can define it in a way where it has the same behaviour as it did in 1.0 though, I'm sure.

- Steve

-- 
Steve Harris, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Wednesday, 9 June 2010 09:08:59 UTC