Re: Prioritised list of open issues (query, my bits) from Andy Seaborne on 2010-02-09 (public-rdf-dawg@w3.org from January to March 2010)

From: Andy Seaborne <andy.seaborne@talis.com>
Date: Tue, 09 Feb 2010 13:53:55 +0000
To: Steve Harris <steve.harris@garlik.com>
CC: Lee Feigenbaum <lee@thefigtrees.net>, "public-rdf-dawg@w3.org Group" <public-rdf-dawg@w3.org>
Message-ID: <4B7168F3.3030101@talis.com>

On 09/02/2010 1:34 PM, Steve Harris wrote:
> On 9 Feb 2010, at 11:27, Andy Seaborne wrote:
>> On 09/02/2010 10:29 AM, Lee Feigenbaum wrote:
>>> Steve Harris wrote:
>>>> On 9 Feb 2010, at 09:00, Andy Seaborne wrote:
>>>>> On 08/02/2010 10:23 AM, Steve Harris wrote:
>>>>>> http://www.w3.org/2009/sparql/track/issues/35
>>>>>> Can aggregate functions take DISTINCT as an argument a la SELECT
>>>>>> COUNT(DISTINCT ?X)?
>>>>>> - Seems consensus on yes.
>>>>>
>>>>> A URI should name the function, not a collection of related
>>>>> functionality.
>>>>>
>>>>> Example:
>>>>>
>>>>> COUNT(DISTINCT ?x) vs COUNT(?x)
>>>>>
>>>>> How do you name the difference if they are not different URIs?
>>>>
>>>> In my view, DISTINCT does not change the function, it changes the
>>>> (multi)set that the function is applied to, c.f.
>>>> http://www.w3.org/2009/sparql/docs/query-1.1/rq25.xml#aggregateAlgebra
>>>>
>>>> More concretely, you form a DISTINCT multiset of the bound values of
>>>> ?x, then apply the count function to the resulting set.
>>>
>>> FWIW, this is exactly how Glitter treats the DISTINCT modifier for both
>>> built-in and custom aggregates. It modified the set of solutions passed
>>> to the aggregate function.
>>
>> The defn in the doc applies it to the values of expressions of the
>> aggregate function (aside, so no seeing the expressions themselves,
>> only the result of after evaluation).
>>
>> If we have, in one partition:
>>
>> (?x=1, ?y=2)
>> (?x=1, ?y=3)
>> (?x=2, ?y=3)
>>
>> which is a set of solutions.
>>
>> I'd expect
>> COUNT(DISTINCT ?x) ==> 2
>> COUNT(DISTINCT fn:floor((?x+1)/2)) ==> 1
>
> Yes,

Good - I wondered when Lee said it was the set of solutions made unique.

>
> if M = your solution multiset above.
> M' = M(fn:floor((?x+1)/2))) { 1, 1, 1 }
> M'' = DISTINCT M' { 1 }
> result = Count(M'') 1
>
> This is how aggregates are defined in SQL, and I can't think of any
> pressing reason to depart from that.
>
>> which is applying the DISTINCT after the implicit projection (case 1)
>> and after expression evaluation (case 2).
>>
>> I thought Steve and I were mostly agreeing, except over whether one
>> can name the DISTINCT and non-DISTINCT versions with URIs.
>
> That's the essence I think. But maybe you're proposing something different?

An aggregate function takes the partition as a multiset and also it's 
and its expressions.  It decides on DISTINCT or not - this is moving the 
ExprMultiset machinery inside the aggregate.

It has a very similar effect but allows the aggregate to decide how to 
handle unbounds, expression errors etc (in SQL, it is the aggregate 
function that decides although all the standard ones do the same).

Different URIs name different aggregators so can have different DISTINCT 
effects.

I found this the most concise description:
http://www.postgresql.org/docs/8.4/static/sql-expressions.html#SYNTAX-AGGREGATES

	Andy

>
>> And, maybe, the treatmeant of * - I prefer a treatment that passes
>> solutions and expressions to the aggregate so * is not different.
>
> I think the way it's expressed in SQL is quite neat. It means that
> COUNT(*) is a special case, but that's not a huge problem in my opinion.
>
> - Steve
>

Received on Tuesday, 9 February 2010 13:54:28 UTC