Re: Semantics of aggregates from Steve Harris on 2009-09-03 (public-rdf-dawg@w3.org from July to September 2009)

From: Steve Harris <steve.harris@garlik.com>
Date: Thu, 3 Sep 2009 11:46:21 +0100
To: "Seaborne, Andy" <andy.seaborne@hp.com>
Cc: "public-rdf-dawg@w3.org Group" <public-rdf-dawg@w3.org>
Message-Id: <670889D2-F54C-4C3A-A6B7-19E427C89926@garlik.com>
On 3 Sep 2009, at 11:32, Seaborne, Andy wrote:
>
>> -----Original Message-----
>> From: Steve Harris [mailto:steve.harris@garlik.com]
>> Sent: 02 September 2009 13:18
>> To: Seaborne, Andy
>> Cc: public-rdf-dawg@w3.org Group
>> Subject: Re: Semantics of aggregates
>>
>> On 1 Sep 2009, at 18:08, Seaborne, Andy wrote:
>>
>>>>> My idea of a common case is when all the rows grouped have the  
>>>>> same
>>>>> type or are undefined.  Is that your common case?
>>>>
>>>> No. Our apps have two cases, one where we control the data (or at
>>>> least the conversion of it) and one where take in arbitrary data  
>>>> from
>>>> outside. I imagine that some people have apps with some combination
>>>> of
>>>> these two, but currently we don't.
>>>
>>> Could you make that a concrete example, with the answers that might
>>> make sense to you? (test case mode = on)
>>
>> Good idea.
>>
>> Lets imagine distributed product reviews from different sources:
>>
>> <thing> rev:hasReview [ rev:rating 1.0 ] .
>> <thing> rev:hasReview [ rev:rating 4 ] .
>> <thing> rev:hasReview [ rev:rating "3" ] .
>> <thing> rev:hasReview [ rev:rating "2.5"^^xsd:double ] .
>
> Can I add one case?
>
> When I tried this myself, I noticed that these happen to give a  
> order when done lexically that the same as the value.
>
> <thing> rev:hasReview [ rev:rating "03" ] .
>
> ("10" comes after 1.0 lexically so that didn't illustrate the point)

Yes, makes it a better example.

> See attached.
>
>>
>> We can get all the ratings using SPARQL/Query 1.0:
>>
>> SELECT ?rating
>> WHERE {
>>   <thing> rev:hasReview [ rev:rating ?rating ] .
>> }
>>
>>
>> But, we might want to get the range of scores: [inventing MIN(),  
>> MAX()
>> and MEAN()]
>>
>> SELECT MIN(?rating), MAX(?rating), MEAN(?rating)
>> WHERE {
>>   <thing> rev:hasReview [ rev:rating ?rating ] .
>> }
>>
>> Here, I might hope to get
>>
>> [1]  MIN(?rating)  MAX(?rating)  MEAN(?rating)
>>      1.0           "4"           2.625
>
> ARQ gets much the same:
>
> ---------------------
> | min | max | avg   |
> =====================
> | 1.0 | 4   | 2.5e0 |
> ---------------------
>
> Because avg ignores the "3" and only considers numeric values.   
> Would that be OK to you?

Yes, it seems like a sensible compromise, under the circumstances.

> With the modified data, I get:
>
> ----------------------
> | min  | max | avg   |
> ======================
> | "03" | 4   | 2.5e0 |
> ----------------------
>
> because ORDER BY happened to uses lexical form first to get a total  
> ordering of terms that can't be compared by "<". But that isn't the  
> only choice - another one would have to been to sort on the  
> different value spaces (and each unknown datatype is separate "value  
> space") - that would have put the string at front or back so it  
> would come out in MIN() or MAX() for any string.
>
> Without changing ORDER BY to be more defined, we either live with  
> this oddity for a design including MIN/1 or return multiple choices.

I'd prefer to live with the oddity myself. I can see the multiple  
values thing being a real pain to live with.

>> But I can see arguments for requiring an explicit "cast" to
>> xsd:decimal or something, like MEAN(?rating, xsd:decimal) or
>> MEAN(xsd:decimal(?rating)) and then getting:
>>
>> [2] MIN(?rating)  MAX(?rating)  MEAN(?rating)
>>     1.0           4.0           2.625
>
> Then I get much the same for MIN(xsd:decimal(....)) etc:
>
> -------------------------------------------------------
> | min | max              | avg                        |
> =======================================================
> | 1.0 | "4"^^xsd:decimal | 2.625000000000000000000000 |
> -------------------------------------------------------
>
> It’s not 4.0 because casting didn't force to canonical form but it  
> is the same value.  As you can see from the AVG , maybe it  
> should :-)  Either way, the RDF term has been changed from  
> xsd:integer to xsd:decimal but the value is the same.
>
> On the modified data:
>
> -------------------------------------------------------
> | min | max              | avg                        |
> =======================================================
> | 1.0 | "4"^^xsd:decimal | 2.700000000000000000000000 |
> -------------------------------------------------------
>
> So casting does do something useful on MIN.

Yep, that looks like a useful, and unsurprising result.

> ARQ ignores silently things that don't evaluates:
>
> <thing> rev:hasReview [ rev:rating "good" ] .

Again, I think that's what I'd expect/hope to happen.

>> [1] would probably require explicit string, date, and numerical
>> aggregate functions, eg STRMIN(), DATEMIN(), which is maybe not a
>> great idea.
>
> If we want [1] to work on the modified data, I think we need that  
> sort of thing but I'm not sure whether it is better to just have a  
> plain MIN/1 and live with the odd effects.

Agreed.

>  Some operator that differed from casting by not changing the  
> datatype of a number to give 1.0 and 4 (not 4^decimal) and skip  
> strings rather than turn them into numbers would put it all in one  
> place and not have a variant for every aggregator.

> valueCompatible(xsd:decimal, ?x) => ?x or an error

Seems excessive, I think the [2] case is fine for this type of  
usecase, though maybe you have some other case in mind that makes this  
more appealing?

>> Also, we might want the histogram: [inventing COUNT(), SAMPLE() and
>> FLOOR()]
>>
>
> fn:floor - not an aggregate?

Yes, not an aggregate function.

>> SELECT xsd:integer(FLOOR(SAMPLE(?rating))), COUNT(?rating)
>> WHERE {
>>   <thing> rev:hasReview [ rev:rating ?rating ] .
>> } GROUP BY FLOOR(?rating)
>>
>> Where I would expect
>>
>> xsd:integer(FLOOR(SAMPLE(?rating)))  COUNT(?rating)
>> 1                                    1
>> 3                                    1
>> 4                                    1
>> 2                                    1
>>
>> In the histogram case I don't see any way round the cast/floor
>> function, as I need to be explicit about what I want done, but that's
>> fine I think.
>>
>>>> In the first case it's as you say, either it's not present, or  
>>>> it's a
>>>> known datatype, in the second case you get all kinds of values.
>>>
>>> Let's get concrete:
>>>
>>> Does option 5 [1] have initial consensus? Use the relationship as
>>> given by "ORDER BY" which includes "<".
>>
>> 5 seems sensible, depending one what's done with unbound values. The
>> obvious thing seems to be that they're not carried over into
>> aggregates, but that should be explicit as they can be sorted on.
>
> I think that an aggregate that uses variables will have to define  
> it's effect on missing values itself; it will have to have a  
> decision about a group collection with nothing in it as well (it has  
> to anyway:  SELECT AGG(?x) { FILTER(false) })
>
> For all the examples so far, that we have that use a variable,  
> skipping unbounds make sense.  (What if it's an expression and the  
> expression does not evaluate? Skip as well?)

On the face of it skipping non-evaluating expressions seems sensible,  
and in keeping with the rest of SPARQL.

> Count(*) counts rows, even if the row is empty.  Count(?x) and  
> count(*) can be different.

Good point. I hadn't considered the empty result case.

There's also the question of whether/how we support the  
AGGREGATE(DISTINCT expr) form as well.

> SELECT count(*) {} ==> 1
> SELECT count(?x) { ?s ?p ?o OPTIONAL { ?p ?q ?x . FILTER (false) } }  
> ==> 0

Seems logical.

- Steve

-- 
Steve Harris
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10  
9AD
Received on Thursday, 3 September 2009 10:47:12 UTC