Re: DISTINCT with aggregates from Andy Seaborne on 2009-11-13 (public-rdf-dawg@w3.org from October to December 2009)

From: Andy Seaborne <andy.seaborne@talis.com>
Date: Fri, 13 Nov 2009 11:23:20 +0000
To: Lee Feigenbaum <lee@thefigtrees.net>
CC: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-ID: <4AFD41A8.7020402@talis.com>

On 13/11/2009 04:41, Lee Feigenbaum wrote:
> Andy Seaborne wrote:
>> > ** DISTINCT in aggregate functions
>> >
>> > Consensus on allowing DISTINCT with multiple arguments to aggregate
>> > functions. DISTINCT in this case passes just the DISTINCT tuples
>> > into the aggregate function (for each group).
>>
>> I'm unclear why it should be allowed in SUM or AVG. Is there a use case?
>
> I asked at the F2F if there were use cases for SUM(DISTINCT ...) and was
> told there were, but don't remember anyone going into the details.
>
> Motivation was overall consistency and consistency with SQL. I also
> described Open Anzo's implementation which does this (allow DISTINCT on
> any aggregate function's arguments, including custom aggregates). What's
> the benefit of not allowing it?
>
>> We are already handling * differently by aggregate and DISNTINCT seems
>> to only really man anything there. Are there specific motivating use
>> cases?
>
> DISTINCT has meaning in plenty of aggregates (e.g. GROUP_CONCAT which
> may or may not end up as a built-in aggregate - still an open issue).
>
>> Is DISTINCT allowed in custom aggregates ? If so, they have different
>> syntax.
>
> The F2F consensus was that DISTINCT is indeed allowed in custom aggregates.
>
>> I propose that DISTINCT is not allowed for custom aggregates. An
>> aggregate can choose to do that operation as part of it's definition
>> but DISTINCT and not-DISTINCT forms are two different URI to name the
>> aggregate.
>
> I am personally not in favor of this proposal. What's the benefit of
> preventing DISTINCT?
 >
> If I needed to have two different URIs for DISTINCT
> and non-DISTINCT versions of custom aggregates, I'd end up inventing a
> new mechanism to describe aggregates to link together distinct and
> non-distinct versions via something in the service description so that I
> can use a user interface to choose an aggregate and whether it acts over
> distinct tuples or not.
>
> But mainly, I don't understand the motivation for prohibiting DISTINCT
> in these cases, given that it has a very clear semantics (distinct'ing
> the solution sets passed into the aggregate) and use cases (such as
> group_concat).

It's a practical consideration of syntax: maybe someone has a proposal 
here but it seems to me that if custom aggregate syntax is function 
syntax, there is nowhere to put the word "DISTINCT". But if function 
syntax changes to allow DISTINCT, it allows it in all places.

I can't see how a new custom aggregate production in the grammar based 
on lookahead for DISTINCT in the expression clause would work because it 
needs to be valid for absence of DISTINCT as well and that is 
syntactically ambiguous as it is a custom function call.

Custom syntax makes it easier, but not nice looking IMHO:

     AGG(URI, [DISTINCT,] exprList)

and we also have the related matter of whether we have to have two sets 
of expression rules, one which allows aggregates and one that doesn't.

 Andy

>
> Lee
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email
> ______________________________________________________________________

Received on Friday, 13 November 2009 11:23:30 UTC