- From: Steve Harris <steve.harris@garlik.com>
- Date: Wed, 2 Sep 2009 13:18:21 +0100
- To: "Seaborne, Andy" <andy.seaborne@hp.com>
- Cc: "public-rdf-dawg@w3.org Group" <public-rdf-dawg@w3.org>
On 1 Sep 2009, at 18:08, Seaborne, Andy wrote: >>> My idea of a common case is when all the rows grouped have the same >>> type or are undefined. Is that your common case? >> >> No. Our apps have two cases, one where we control the data (or at >> least the conversion of it) and one where take in arbitrary data from >> outside. I imagine that some people have apps with some combination >> of >> these two, but currently we don't. > > Could you make that a concrete example, with the answers that might > make sense to you? (test case mode = on) Good idea. Lets imagine distributed product reviews from different sources: <thing> rev:hasReview [ rev:rating 1.0 ] . <thing> rev:hasReview [ rev:rating 4 ] . <thing> rev:hasReview [ rev:rating "3" ] . <thing> rev:hasReview [ rev:rating "2.5"^^xsd:double ] . We can get all the ratings using SPARQL/Query 1.0: SELECT ?rating WHERE { <thing> rev:hasReview [ rev:rating ?rating ] . } But, we might want to get the range of scores: [inventing MIN(), MAX() and MEAN()] SELECT MIN(?rating), MAX(?rating), MEAN(?rating) WHERE { <thing> rev:hasReview [ rev:rating ?rating ] . } Here, I might hope to get [1] MIN(?rating) MAX(?rating) MEAN(?rating) 1.0 "4" 2.625 But I can see arguments for requiring an explicit "cast" to xsd:decimal or something, like MEAN(?rating, xsd:decimal) or MEAN(xsd:decimal(?rating)) and then getting: [2] MIN(?rating) MAX(?rating) MEAN(?rating) 1.0 4.0 2.625 [1] would probably require explicit string, date, and numerical aggregate functions, eg STRMIN(), DATEMIN(), which is maybe not a great idea. Also, we might want the histogram: [inventing COUNT(), SAMPLE() and FLOOR()] SELECT xsd:integer(FLOOR(SAMPLE(?rating))), COUNT(?rating) WHERE { <thing> rev:hasReview [ rev:rating ?rating ] . } GROUP BY FLOOR(?rating) Where I would expect xsd:integer(FLOOR(SAMPLE(?rating))) COUNT(?rating) 1 1 3 1 4 1 2 1 In the histogram case I don't see any way round the cast/floor function, as I need to be explicit about what I want done, but that's fine I think. >> In the first case it's as you say, either it's not present, or it's a >> known datatype, in the second case you get all kinds of values. > > Let's get concrete: > > Does option 5 [1] have initial consensus? Use the relationship as > given by "ORDER BY" which includes "<". 5 seems sensible, depending one what's done with unbound values. The obvious thing seems to be that they're not carried over into aggregates, but that should be explicit as they can be sorted on. > This gives a spec-determined answer for many cases but not all. > Local implementations may give a total ordering using "ORDER BY" but > it's not to be relied on across implementations (I could not do that > in ARQ because the ordering changes as you add datatypes understood > by "<") Yes, that's an issue. None of my system have the ability to dynamically add datatypes. > With this: > > SELECT MIN(?x) { .. } # No GROUP BY > is the same as: > SELECT ?x { .. } ORDER BY ?x LIMIT 1 > > where ORDER BY is determined. That seems like it has low user surprise in it's favour. > Across a mix of type of RDF node, you may not get the type your > application wants but that is possible if you use MIN(some > expression). > > What cases do people see where this is inconvenient? - Steve -- Steve Harris Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK +44(0)20 8973 2465 http://www.garlik.com/ Registered in England and Wales 535 7233 VAT # 849 0517 11 Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Wednesday, 2 September 2009 12:18:58 UTC