- From: Steve Harris <steve.harris@garlik.com>
- Date: Thu, 3 Sep 2009 11:46:21 +0100
- To: "Seaborne, Andy" <andy.seaborne@hp.com>
- Cc: "public-rdf-dawg@w3.org Group" <public-rdf-dawg@w3.org>
On 3 Sep 2009, at 11:32, Seaborne, Andy wrote: > >> -----Original Message----- >> From: Steve Harris [mailto:steve.harris@garlik.com] >> Sent: 02 September 2009 13:18 >> To: Seaborne, Andy >> Cc: public-rdf-dawg@w3.org Group >> Subject: Re: Semantics of aggregates >> >> On 1 Sep 2009, at 18:08, Seaborne, Andy wrote: >> >>>>> My idea of a common case is when all the rows grouped have the >>>>> same >>>>> type or are undefined. Is that your common case? >>>> >>>> No. Our apps have two cases, one where we control the data (or at >>>> least the conversion of it) and one where take in arbitrary data >>>> from >>>> outside. I imagine that some people have apps with some combination >>>> of >>>> these two, but currently we don't. >>> >>> Could you make that a concrete example, with the answers that might >>> make sense to you? (test case mode = on) >> >> Good idea. >> >> Lets imagine distributed product reviews from different sources: >> >> <thing> rev:hasReview [ rev:rating 1.0 ] . >> <thing> rev:hasReview [ rev:rating 4 ] . >> <thing> rev:hasReview [ rev:rating "3" ] . >> <thing> rev:hasReview [ rev:rating "2.5"^^xsd:double ] . > > Can I add one case? > > When I tried this myself, I noticed that these happen to give a > order when done lexically that the same as the value. > > <thing> rev:hasReview [ rev:rating "03" ] . > > ("10" comes after 1.0 lexically so that didn't illustrate the point) Yes, makes it a better example. > See attached. > >> >> We can get all the ratings using SPARQL/Query 1.0: >> >> SELECT ?rating >> WHERE { >> <thing> rev:hasReview [ rev:rating ?rating ] . >> } >> >> >> But, we might want to get the range of scores: [inventing MIN(), >> MAX() >> and MEAN()] >> >> SELECT MIN(?rating), MAX(?rating), MEAN(?rating) >> WHERE { >> <thing> rev:hasReview [ rev:rating ?rating ] . >> } >> >> Here, I might hope to get >> >> [1] MIN(?rating) MAX(?rating) MEAN(?rating) >> 1.0 "4" 2.625 > > ARQ gets much the same: > > --------------------- > | min | max | avg | > ===================== > | 1.0 | 4 | 2.5e0 | > --------------------- > > Because avg ignores the "3" and only considers numeric values. > Would that be OK to you? Yes, it seems like a sensible compromise, under the circumstances. > With the modified data, I get: > > ---------------------- > | min | max | avg | > ====================== > | "03" | 4 | 2.5e0 | > ---------------------- > > because ORDER BY happened to uses lexical form first to get a total > ordering of terms that can't be compared by "<". But that isn't the > only choice - another one would have to been to sort on the > different value spaces (and each unknown datatype is separate "value > space") - that would have put the string at front or back so it > would come out in MIN() or MAX() for any string. > > Without changing ORDER BY to be more defined, we either live with > this oddity for a design including MIN/1 or return multiple choices. I'd prefer to live with the oddity myself. I can see the multiple values thing being a real pain to live with. >> But I can see arguments for requiring an explicit "cast" to >> xsd:decimal or something, like MEAN(?rating, xsd:decimal) or >> MEAN(xsd:decimal(?rating)) and then getting: >> >> [2] MIN(?rating) MAX(?rating) MEAN(?rating) >> 1.0 4.0 2.625 > > Then I get much the same for MIN(xsd:decimal(....)) etc: > > ------------------------------------------------------- > | min | max | avg | > ======================================================= > | 1.0 | "4"^^xsd:decimal | 2.625000000000000000000000 | > ------------------------------------------------------- > > It’s not 4.0 because casting didn't force to canonical form but it > is the same value. As you can see from the AVG , maybe it > should :-) Either way, the RDF term has been changed from > xsd:integer to xsd:decimal but the value is the same. > > On the modified data: > > ------------------------------------------------------- > | min | max | avg | > ======================================================= > | 1.0 | "4"^^xsd:decimal | 2.700000000000000000000000 | > ------------------------------------------------------- > > So casting does do something useful on MIN. Yep, that looks like a useful, and unsurprising result. > ARQ ignores silently things that don't evaluates: > > <thing> rev:hasReview [ rev:rating "good" ] . Again, I think that's what I'd expect/hope to happen. >> [1] would probably require explicit string, date, and numerical >> aggregate functions, eg STRMIN(), DATEMIN(), which is maybe not a >> great idea. > > If we want [1] to work on the modified data, I think we need that > sort of thing but I'm not sure whether it is better to just have a > plain MIN/1 and live with the odd effects. Agreed. > Some operator that differed from casting by not changing the > datatype of a number to give 1.0 and 4 (not 4^decimal) and skip > strings rather than turn them into numbers would put it all in one > place and not have a variant for every aggregator. > valueCompatible(xsd:decimal, ?x) => ?x or an error Seems excessive, I think the [2] case is fine for this type of usecase, though maybe you have some other case in mind that makes this more appealing? >> Also, we might want the histogram: [inventing COUNT(), SAMPLE() and >> FLOOR()] >> > > fn:floor - not an aggregate? Yes, not an aggregate function. >> SELECT xsd:integer(FLOOR(SAMPLE(?rating))), COUNT(?rating) >> WHERE { >> <thing> rev:hasReview [ rev:rating ?rating ] . >> } GROUP BY FLOOR(?rating) >> >> Where I would expect >> >> xsd:integer(FLOOR(SAMPLE(?rating))) COUNT(?rating) >> 1 1 >> 3 1 >> 4 1 >> 2 1 >> >> In the histogram case I don't see any way round the cast/floor >> function, as I need to be explicit about what I want done, but that's >> fine I think. >> >>>> In the first case it's as you say, either it's not present, or >>>> it's a >>>> known datatype, in the second case you get all kinds of values. >>> >>> Let's get concrete: >>> >>> Does option 5 [1] have initial consensus? Use the relationship as >>> given by "ORDER BY" which includes "<". >> >> 5 seems sensible, depending one what's done with unbound values. The >> obvious thing seems to be that they're not carried over into >> aggregates, but that should be explicit as they can be sorted on. > > I think that an aggregate that uses variables will have to define > it's effect on missing values itself; it will have to have a > decision about a group collection with nothing in it as well (it has > to anyway: SELECT AGG(?x) { FILTER(false) }) > > For all the examples so far, that we have that use a variable, > skipping unbounds make sense. (What if it's an expression and the > expression does not evaluate? Skip as well?) On the face of it skipping non-evaluating expressions seems sensible, and in keeping with the rest of SPARQL. > Count(*) counts rows, even if the row is empty. Count(?x) and > count(*) can be different. Good point. I hadn't considered the empty result case. There's also the question of whether/how we support the AGGREGATE(DISTINCT expr) form as well. > SELECT count(*) {} ==> 1 > SELECT count(?x) { ?s ?p ?o OPTIONAL { ?p ?q ?x . FILTER (false) } } > ==> 0 Seems logical. - Steve -- Steve Harris Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK +44(0)20 8973 2465 http://www.garlik.com/ Registered in England and Wales 535 7233 VAT # 849 0517 11 Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Thursday, 3 September 2009 10:47:12 UTC