Re: DISTINCT with aggregates from Lee Feigenbaum on 2009-11-13 (public-rdf-dawg@w3.org from October to December 2009)

From: Lee Feigenbaum <lee@thefigtrees.net>
Date: Fri, 13 Nov 2009 09:21:49 -0500
To: Andy Seaborne <andy.seaborne@talis.com>
CC: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-ID: <4AFD6B7D.8040104@thefigtrees.net>
Andy Seaborne wrote:
> 
> 
> On 13/11/2009 04:41, Lee Feigenbaum wrote:
>> Andy Seaborne wrote:
>>> > ** DISTINCT in aggregate functions
>>> >
>>> > Consensus on allowing DISTINCT with multiple arguments to aggregate
>>> > functions. DISTINCT in this case passes just the DISTINCT tuples
>>> > into the aggregate function (for each group).
>>>
>>> I'm unclear why it should be allowed in SUM or AVG. Is there a use case?
>>
>> I asked at the F2F if there were use cases for SUM(DISTINCT ...) and was
>> told there were, but don't remember anyone going into the details.
>>
>> Motivation was overall consistency and consistency with SQL. I also
>> described Open Anzo's implementation which does this (allow DISTINCT on
>> any aggregate function's arguments, including custom aggregates). What's
>> the benefit of not allowing it?
>>
>>> We are already handling * differently by aggregate and DISNTINCT seems
>>> to only really man anything there. Are there specific motivating use
>>> cases?
>>
>> DISTINCT has meaning in plenty of aggregates (e.g. GROUP_CONCAT which
>> may or may not end up as a built-in aggregate - still an open issue).
>>
>>> Is DISTINCT allowed in custom aggregates ? If so, they have different
>>> syntax.
>>
>> The F2F consensus was that DISTINCT is indeed allowed in custom 
>> aggregates.
>>
>>> I propose that DISTINCT is not allowed for custom aggregates. An
>>> aggregate can choose to do that operation as part of it's definition
>>> but DISTINCT and not-DISTINCT forms are two different URI to name the
>>> aggregate.
>>
>> I am personally not in favor of this proposal. What's the benefit of
>> preventing DISTINCT?
>  >
>> If I needed to have two different URIs for DISTINCT
>> and non-DISTINCT versions of custom aggregates, I'd end up inventing a
>> new mechanism to describe aggregates to link together distinct and
>> non-distinct versions via something in the service description so that I
>> can use a user interface to choose an aggregate and whether it acts over
>> distinct tuples or not.
>>
>> But mainly, I don't understand the motivation for prohibiting DISTINCT
>> in these cases, given that it has a very clear semantics (distinct'ing
>> the solution sets passed into the aggregate) and use cases (such as
>> group_concat).
> 
> It's a practical consideration of syntax: maybe someone has a proposal 
> here but it seems to me that if custom aggregate syntax is function 
> syntax, there is nowhere to put the word "DISTINCT". But if function 
> syntax changes to allow DISTINCT, it allows it in all places.
> 
> I can't see how a new custom aggregate production in the grammar based 
> on lookahead for DISTINCT in the expression clause would work because it 
> needs to be valid for absence of DISTINCT as well and that is 
> syntactically ambiguous as it is a custom function call.
> 
> Custom syntax makes it easier, but not nice looking IMHO:
> 
>     AGG(URI, [DISTINCT,] exprList)
> 
> and we also have the related matter of whether we have to have two sets 
> of expression rules, one which allows aggregates and one that doesn't.

Thanks, Andy. I hadn't understood previously that this is a 
grammar/syntax issue. In Open Anzo, I check this in code while parsing 
and throw an exception if DISTINCT is used with a non-aggregate function.

I guess this is sort of related to the question of a whether we want a 
keyword to introduce custom aggregate functions: how important is it to 
minimize the number of invalid queries that are syntactically valid?

I believe in SPARQL 1.0 the only such query involves bnode labels 
spanning BGPs?

lee


>     Andy
> 
>>
>> Lee
>>
>> ______________________________________________________________________
>> This email has been scanned by the MessageLabs Email Security System.
>> For more information please visit http://www.messagelabs.com/email
>> ______________________________________________________________________
>
Received on Friday, 13 November 2009 14:22:33 UTC