DISTINCT with aggregates (was: Re: Views on the outcomes of F2F) from Lee Feigenbaum on 2009-11-13 (public-rdf-dawg@w3.org from October to December 2009)

From: Lee Feigenbaum <lee@thefigtrees.net>
Date: Thu, 12 Nov 2009 23:41:34 -0500
To: Andy Seaborne <andy.seaborne@talis.com>
CC: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-ID: <4AFCE37E.5090601@thefigtrees.net>

Andy Seaborne wrote:
>  > **  DISTINCT in aggregate functions
>  >
>  > Consensus on allowing DISTINCT with multiple arguments to aggregate
>  > functions. DISTINCT in this case passes just the DISTINCT tuples
>  > into the aggregate function (for each group).
> 
> I'm unclear why it should be allowed in SUM or AVG.  Is there a use case?

I asked at the F2F if there were use cases for SUM(DISTINCT ...) and was 
told there were, but don't remember anyone going into the details.

Motivation was overall consistency and consistency with SQL. I also 
described Open Anzo's implementation which does this (allow DISTINCT on 
any aggregate function's arguments, including custom aggregates). What's 
the benefit of not allowing it?

> We are already handling * differently by aggregate and DISNTINCT seems 
> to only really man anything there.  Are there specific motivating use 
> cases?

DISTINCT has meaning in plenty of aggregates (e.g. GROUP_CONCAT which 
may or may not end up as a built-in aggregate - still an open issue).

> Is DISTINCT allowed in custom aggregates ?  If so, they have different 
> syntax.

The F2F consensus was that DISTINCT is indeed allowed in custom aggregates.

> I propose that DISTINCT is not allowed for custom aggregates.  An 
> aggregate can choose to do that operation as part of it's definition but 
> DISTINCT and not-DISTINCT forms are two different URI to name the 
> aggregate.

I am personally not in favor of this proposal. What's the benefit of 
preventing DISTINCT? If I needed to have two different URIs for DISTINCT 
and non-DISTINCT versions of custom aggregates, I'd end up inventing a 
new mechanism to describe aggregates to link together distinct and 
non-distinct versions via something in the service description so that I 
can use a user interface to choose an aggregate and whether it acts over 
distinct tuples or not.

But mainly, I don't understand the motivation for prohibiting DISTINCT 
in these cases, given that it has a very clear semantics (distinct'ing 
the solution sets passed into the aggregate) and use cases (such as 
group_concat).

Lee

Received on Friday, 13 November 2009 04:42:17 UTC