RE: Semantics of aggregates from Seaborne, Andy on 2009-09-03 (public-rdf-dawg@w3.org from July to September 2009)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Thu, 3 Sep 2009 10:32:11 +0000
To: Steve Harris <steve.harris@garlik.com>
CC: "public-rdf-dawg@w3.org Group" <public-rdf-dawg@w3.org>
Message-ID: <B6CF1054FDC8B845BF93A6645D19BEA3693CF63A98@GVW1118EXC.americas.hpqcorp.net>


> -----Original Message-----
> From: Steve Harris [mailto:steve.harris@garlik.com]
> Sent: 02 September 2009 13:18
> To: Seaborne, Andy
> Cc: public-rdf-dawg@w3.org Group
> Subject: Re: Semantics of aggregates
> 
> On 1 Sep 2009, at 18:08, Seaborne, Andy wrote:
> 
> >>> My idea of a common case is when all the rows grouped have the same
> >>> type or are undefined.  Is that your common case?
> >>
> >> No. Our apps have two cases, one where we control the data (or at
> >> least the conversion of it) and one where take in arbitrary data from
> >> outside. I imagine that some people have apps with some combination
> >> of
> >> these two, but currently we don't.
> >
> > Could you make that a concrete example, with the answers that might
> > make sense to you? (test case mode = on)
> 
> Good idea.
> 
> Lets imagine distributed product reviews from different sources:
> 
> <thing> rev:hasReview [ rev:rating 1.0 ] .
> <thing> rev:hasReview [ rev:rating 4 ] .
> <thing> rev:hasReview [ rev:rating "3" ] .
> <thing> rev:hasReview [ rev:rating "2.5"^^xsd:double ] .

Can I add one case?

When I tried this myself, I noticed that these happen to give a order when done lexically that the same as the value.

<thing> rev:hasReview [ rev:rating "03" ] .

("10" comes after 1.0 lexically so that didn't illustrate the point)

See attached.

> 
> We can get all the ratings using SPARQL/Query 1.0:
> 
> SELECT ?rating
> WHERE {
>    <thing> rev:hasReview [ rev:rating ?rating ] .
> }
> 
> 
> But, we might want to get the range of scores: [inventing MIN(), MAX()
> and MEAN()]
> 
> SELECT MIN(?rating), MAX(?rating), MEAN(?rating)
> WHERE {
>    <thing> rev:hasReview [ rev:rating ?rating ] .
> }
> 
> Here, I might hope to get
> 
> [1]  MIN(?rating)  MAX(?rating)  MEAN(?rating)
>       1.0           "4"           2.625

ARQ gets much the same:

---------------------
| min | max | avg   |
=====================
| 1.0 | 4   | 2.5e0 |
---------------------

Because avg ignores the "3" and only considers numeric values.  Would that be OK to you?

With the modified data, I get:

----------------------
| min  | max | avg   |
======================
| "03" | 4   | 2.5e0 |
----------------------

because ORDER BY happened to uses lexical form first to get a total ordering of terms that can't be compared by "<". But that isn't the only choice - another one would have to been to sort on the different value spaces (and each unknown datatype is separate "value space") - that would have put the string at front or back so it would come out in MIN() or MAX() for any string.

Without changing ORDER BY to be more defined, we either live with this oddity for a design including MIN/1 or return multiple choices.

> But I can see arguments for requiring an explicit "cast" to
> xsd:decimal or something, like MEAN(?rating, xsd:decimal) or
> MEAN(xsd:decimal(?rating)) and then getting:
> 
> [2] MIN(?rating)  MAX(?rating)  MEAN(?rating)
>      1.0           4.0           2.625

Then I get much the same for MIN(xsd:decimal(....)) etc:

-------------------------------------------------------
| min | max              | avg                        |
=======================================================
| 1.0 | "4"^^xsd:decimal | 2.625000000000000000000000 |
-------------------------------------------------------

It’s not 4.0 because casting didn't force to canonical form but it is the same value.  As you can see from the AVG , maybe it should :-)  Either way, the RDF term has been changed from xsd:integer to xsd:decimal but the value is the same.

On the modified data:

-------------------------------------------------------
| min | max              | avg                        |
=======================================================
| 1.0 | "4"^^xsd:decimal | 2.700000000000000000000000 |
-------------------------------------------------------

So casting does do something useful on MIN.

ARQ ignores silently things that don't evaluates:

<thing> rev:hasReview [ rev:rating "good" ] .

> [1] would probably require explicit string, date, and numerical
> aggregate functions, eg STRMIN(), DATEMIN(), which is maybe not a
> great idea.

If we want [1] to work on the modified data, I think we need that sort of thing but I'm not sure whether it is better to just have a plain MIN/1 and live with the odd effects.  Some operator that differed from casting by not changing the datatype of a number to give 1.0 and 4 (not 4^decimal) and skip strings rather than turn them into numbers would put it all in one place and not have a variant for every aggregator.

valueCompatible(xsd:decimal, ?x) => ?x or an error

> Also, we might want the histogram: [inventing COUNT(), SAMPLE() and
> FLOOR()]
> 

fn:floor - not an aggregate?

> SELECT xsd:integer(FLOOR(SAMPLE(?rating))), COUNT(?rating)
> WHERE {
>    <thing> rev:hasReview [ rev:rating ?rating ] .
> } GROUP BY FLOOR(?rating)
> 
> Where I would expect
> 
> xsd:integer(FLOOR(SAMPLE(?rating)))  COUNT(?rating)
> 1                                    1
> 3                                    1
> 4                                    1
> 2                                    1
> 
> In the histogram case I don't see any way round the cast/floor
> function, as I need to be explicit about what I want done, but that's
> fine I think.
> 
> >> In the first case it's as you say, either it's not present, or it's a
> >> known datatype, in the second case you get all kinds of values.
> >
> > Let's get concrete:
> >
> > Does option 5 [1] have initial consensus? Use the relationship as
> > given by "ORDER BY" which includes "<".
> 
> 5 seems sensible, depending one what's done with unbound values. The
> obvious thing seems to be that they're not carried over into
> aggregates, but that should be explicit as they can be sorted on.

I think that an aggregate that uses variables will have to define it's effect on missing values itself; it will have to have a decision about a group collection with nothing in it as well (it has to anyway:  SELECT AGG(?x) { FILTER(false) })

For all the examples so far, that we have that use a variable, skipping unbounds make sense.  (What if it's an expression and the expression does not evaluate? Skip as well?) 

Count(*) counts rows, even if the row is empty.  Count(?x) and count(*) can be different.

SELECT count(*) {} ==> 1
SELECT count(?x) { ?s ?p ?o OPTIONAL { ?p ?q ?x . FILTER (false) } } ==> 0

 Andy

> 
> > This gives a spec-determined answer for many cases but not all.
> > Local implementations may give a total ordering using "ORDER BY" but
> > it's not to be relied on across implementations (I could not do that
> > in ARQ because the ordering changes as you add datatypes understood
> > by "<")
> 
> Yes, that's an issue. None of my system have the ability to
> dynamically add datatypes.
> 
> > With this:
> >
> > SELECT MIN(?x) { .. } # No GROUP BY
> > is the same as:
> > SELECT ?x { .. } ORDER BY ?x LIMIT 1
> >
> > where ORDER BY is determined.
> 
> That seems like it has low user surprise in it's favour.
> 
> > Across a mix of type of RDF node, you may not get the type your
> > application wants but that is possible if you use MIN(some
> > expression).
> >
> > What cases do people see where this is inconvenient?
> 
> 
> - Steve
> 
> --
> Steve Harris
> Garlik Limited, 2 Sheen Road, Richmond, TW9 1AE, UK
> +44(0)20 8973 2465  http://www.garlik.com/

> Registered in England and Wales 535 7233 VAT # 849 0517 11
> Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10
> 9AD
Attachments

application/octet-stream attachment: D.ttl
Received on Thursday, 3 September 2009 10:32:37 UTC