Re: Semantics of aggregates

On 1 Sep 2009, at 18:08, Seaborne, Andy wrote:

>>> My idea of a common case is when all the rows grouped have the same
>>> type or are undefined.  Is that your common case?
>>
>> No. Our apps have two cases, one where we control the data (or at
>> least the conversion of it) and one where take in arbitrary data from
>> outside. I imagine that some people have apps with some combination  
>> of
>> these two, but currently we don't.
>
> Could you make that a concrete example, with the answers that might  
> make sense to you? (test case mode = on)

Good idea.

Lets imagine distributed product reviews from different sources:

<thing> rev:hasReview [ rev:rating 1.0 ] .
<thing> rev:hasReview [ rev:rating 4 ] .
<thing> rev:hasReview [ rev:rating "3" ] .
<thing> rev:hasReview [ rev:rating "2.5"^^xsd:double ] .

We can get all the ratings using SPARQL/Query 1.0:

SELECT ?rating
WHERE {
   <thing> rev:hasReview [ rev:rating ?rating ] .
}


But, we might want to get the range of scores: [inventing MIN(), MAX()  
and MEAN()]

SELECT MIN(?rating), MAX(?rating), MEAN(?rating)
WHERE {
   <thing> rev:hasReview [ rev:rating ?rating ] .
}

Here, I might hope to get

[1]  MIN(?rating)  MAX(?rating)  MEAN(?rating)
      1.0           "4"           2.625

But I can see arguments for requiring an explicit "cast" to  
xsd:decimal or something, like MEAN(?rating, xsd:decimal) or  
MEAN(xsd:decimal(?rating)) and then getting:

[2] MIN(?rating)  MAX(?rating)  MEAN(?rating)
     1.0           4.0           2.625

[1] would probably require explicit string, date, and numerical  
aggregate functions, eg STRMIN(), DATEMIN(), which is maybe not a  
great idea.


Also, we might want the histogram: [inventing COUNT(), SAMPLE() and  
FLOOR()]

SELECT xsd:integer(FLOOR(SAMPLE(?rating))), COUNT(?rating)
WHERE {
   <thing> rev:hasReview [ rev:rating ?rating ] .
} GROUP BY FLOOR(?rating)

Where I would expect

xsd:integer(FLOOR(SAMPLE(?rating)))  COUNT(?rating)
1                                    1
3                                    1
4                                    1
2                                    1

In the histogram case I don't see any way round the cast/floor  
function, as I need to be explicit about what I want done, but that's  
fine I think.

>> In the first case it's as you say, either it's not present, or it's a
>> known datatype, in the second case you get all kinds of values.
>
> Let's get concrete:
>
> Does option 5 [1] have initial consensus? Use the relationship as  
> given by "ORDER BY" which includes "<".

5 seems sensible, depending one what's done with unbound values. The  
obvious thing seems to be that they're not carried over into  
aggregates, but that should be explicit as they can be sorted on.

> This gives a spec-determined answer for many cases but not all.   
> Local implementations may give a total ordering using "ORDER BY" but  
> it's not to be relied on across implementations (I could not do that  
> in ARQ because the ordering changes as you add datatypes understood  
> by "<")

Yes, that's an issue. None of my system have the ability to  
dynamically add datatypes.

> With this:
>
> SELECT MIN(?x) { .. } # No GROUP BY
> is the same as:
> SELECT ?x { .. } ORDER BY ?x LIMIT 1
>
> where ORDER BY is determined.

That seems like it has low user surprise in it's favour.

> Across a mix of type of RDF node, you may not get the type your  
> application wants but that is possible if you use MIN(some  
> expression).
>
> What cases do people see where this is inconvenient?


- Steve

-- 
Steve Harris
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10  
9AD

Received on Wednesday, 2 September 2009 12:18:58 UTC