- From: Steve Harris <steve.harris@garlik.com>
- Date: Thu, 3 Sep 2009 11:46:21 +0100
- To: "Seaborne, Andy" <andy.seaborne@hp.com>
- Cc: "public-rdf-dawg@w3.org Group" <public-rdf-dawg@w3.org>
On 3 Sep 2009, at 11:32, Seaborne, Andy wrote:
>
>> -----Original Message-----
>> From: Steve Harris [mailto:steve.harris@garlik.com]
>> Sent: 02 September 2009 13:18
>> To: Seaborne, Andy
>> Cc: public-rdf-dawg@w3.org Group
>> Subject: Re: Semantics of aggregates
>>
>> On 1 Sep 2009, at 18:08, Seaborne, Andy wrote:
>>
>>>>> My idea of a common case is when all the rows grouped have the
>>>>> same
>>>>> type or are undefined. Is that your common case?
>>>>
>>>> No. Our apps have two cases, one where we control the data (or at
>>>> least the conversion of it) and one where take in arbitrary data
>>>> from
>>>> outside. I imagine that some people have apps with some combination
>>>> of
>>>> these two, but currently we don't.
>>>
>>> Could you make that a concrete example, with the answers that might
>>> make sense to you? (test case mode = on)
>>
>> Good idea.
>>
>> Lets imagine distributed product reviews from different sources:
>>
>> <thing> rev:hasReview [ rev:rating 1.0 ] .
>> <thing> rev:hasReview [ rev:rating 4 ] .
>> <thing> rev:hasReview [ rev:rating "3" ] .
>> <thing> rev:hasReview [ rev:rating "2.5"^^xsd:double ] .
>
> Can I add one case?
>
> When I tried this myself, I noticed that these happen to give a
> order when done lexically that the same as the value.
>
> <thing> rev:hasReview [ rev:rating "03" ] .
>
> ("10" comes after 1.0 lexically so that didn't illustrate the point)
Yes, makes it a better example.
> See attached.
>
>>
>> We can get all the ratings using SPARQL/Query 1.0:
>>
>> SELECT ?rating
>> WHERE {
>> <thing> rev:hasReview [ rev:rating ?rating ] .
>> }
>>
>>
>> But, we might want to get the range of scores: [inventing MIN(),
>> MAX()
>> and MEAN()]
>>
>> SELECT MIN(?rating), MAX(?rating), MEAN(?rating)
>> WHERE {
>> <thing> rev:hasReview [ rev:rating ?rating ] .
>> }
>>
>> Here, I might hope to get
>>
>> [1] MIN(?rating) MAX(?rating) MEAN(?rating)
>> 1.0 "4" 2.625
>
> ARQ gets much the same:
>
> ---------------------
> | min | max | avg |
> =====================
> | 1.0 | 4 | 2.5e0 |
> ---------------------
>
> Because avg ignores the "3" and only considers numeric values.
> Would that be OK to you?
Yes, it seems like a sensible compromise, under the circumstances.
> With the modified data, I get:
>
> ----------------------
> | min | max | avg |
> ======================
> | "03" | 4 | 2.5e0 |
> ----------------------
>
> because ORDER BY happened to uses lexical form first to get a total
> ordering of terms that can't be compared by "<". But that isn't the
> only choice - another one would have to been to sort on the
> different value spaces (and each unknown datatype is separate "value
> space") - that would have put the string at front or back so it
> would come out in MIN() or MAX() for any string.
>
> Without changing ORDER BY to be more defined, we either live with
> this oddity for a design including MIN/1 or return multiple choices.
I'd prefer to live with the oddity myself. I can see the multiple
values thing being a real pain to live with.
>> But I can see arguments for requiring an explicit "cast" to
>> xsd:decimal or something, like MEAN(?rating, xsd:decimal) or
>> MEAN(xsd:decimal(?rating)) and then getting:
>>
>> [2] MIN(?rating) MAX(?rating) MEAN(?rating)
>> 1.0 4.0 2.625
>
> Then I get much the same for MIN(xsd:decimal(....)) etc:
>
> -------------------------------------------------------
> | min | max | avg |
> =======================================================
> | 1.0 | "4"^^xsd:decimal | 2.625000000000000000000000 |
> -------------------------------------------------------
>
> It’s not 4.0 because casting didn't force to canonical form but it
> is the same value. As you can see from the AVG , maybe it
> should :-) Either way, the RDF term has been changed from
> xsd:integer to xsd:decimal but the value is the same.
>
> On the modified data:
>
> -------------------------------------------------------
> | min | max | avg |
> =======================================================
> | 1.0 | "4"^^xsd:decimal | 2.700000000000000000000000 |
> -------------------------------------------------------
>
> So casting does do something useful on MIN.
Yep, that looks like a useful, and unsurprising result.
> ARQ ignores silently things that don't evaluates:
>
> <thing> rev:hasReview [ rev:rating "good" ] .
Again, I think that's what I'd expect/hope to happen.
>> [1] would probably require explicit string, date, and numerical
>> aggregate functions, eg STRMIN(), DATEMIN(), which is maybe not a
>> great idea.
>
> If we want [1] to work on the modified data, I think we need that
> sort of thing but I'm not sure whether it is better to just have a
> plain MIN/1 and live with the odd effects.
Agreed.
> Some operator that differed from casting by not changing the
> datatype of a number to give 1.0 and 4 (not 4^decimal) and skip
> strings rather than turn them into numbers would put it all in one
> place and not have a variant for every aggregator.
> valueCompatible(xsd:decimal, ?x) => ?x or an error
Seems excessive, I think the [2] case is fine for this type of
usecase, though maybe you have some other case in mind that makes this
more appealing?
>> Also, we might want the histogram: [inventing COUNT(), SAMPLE() and
>> FLOOR()]
>>
>
> fn:floor - not an aggregate?
Yes, not an aggregate function.
>> SELECT xsd:integer(FLOOR(SAMPLE(?rating))), COUNT(?rating)
>> WHERE {
>> <thing> rev:hasReview [ rev:rating ?rating ] .
>> } GROUP BY FLOOR(?rating)
>>
>> Where I would expect
>>
>> xsd:integer(FLOOR(SAMPLE(?rating))) COUNT(?rating)
>> 1 1
>> 3 1
>> 4 1
>> 2 1
>>
>> In the histogram case I don't see any way round the cast/floor
>> function, as I need to be explicit about what I want done, but that's
>> fine I think.
>>
>>>> In the first case it's as you say, either it's not present, or
>>>> it's a
>>>> known datatype, in the second case you get all kinds of values.
>>>
>>> Let's get concrete:
>>>
>>> Does option 5 [1] have initial consensus? Use the relationship as
>>> given by "ORDER BY" which includes "<".
>>
>> 5 seems sensible, depending one what's done with unbound values. The
>> obvious thing seems to be that they're not carried over into
>> aggregates, but that should be explicit as they can be sorted on.
>
> I think that an aggregate that uses variables will have to define
> it's effect on missing values itself; it will have to have a
> decision about a group collection with nothing in it as well (it has
> to anyway: SELECT AGG(?x) { FILTER(false) })
>
> For all the examples so far, that we have that use a variable,
> skipping unbounds make sense. (What if it's an expression and the
> expression does not evaluate? Skip as well?)
On the face of it skipping non-evaluating expressions seems sensible,
and in keeping with the rest of SPARQL.
> Count(*) counts rows, even if the row is empty. Count(?x) and
> count(*) can be different.
Good point. I hadn't considered the empty result case.
There's also the question of whether/how we support the
AGGREGATE(DISTINCT expr) form as well.
> SELECT count(*) {} ==> 1
> SELECT count(?x) { ?s ?p ?o OPTIONAL { ?p ?q ?x . FILTER (false) } }
> ==> 0
Seems logical.
- Steve
--
Steve Harris
Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
+44(0)20 8973 2465 http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10
9AD
Received on Thursday, 3 September 2009 10:47:12 UTC