From: Steve Harris <steve.harris@garlik.com>

Date: Sun, 7 Mar 2010 21:33:49 +0000

Cc: SPARQL Working Group <public-rdf-dawg@w3.org>

Message-Id: <F0F0295C-938A-4C39-9148-2486D73C61CF@garlik.com>

To: Andy Seaborne <andy.seaborne@talis.com>

Date: Sun, 7 Mar 2010 21:33:49 +0000

Cc: SPARQL Working Group <public-rdf-dawg@w3.org>

Message-Id: <F0F0295C-938A-4C39-9148-2486D73C61CF@garlik.com>

To: Andy Seaborne <andy.seaborne@talis.com>

On 7 Mar 2010, at 17:42, Andy Seaborne wrote: > ISSUE-53 > > I propose the following to define ExprMultiSet: > > ------- > > Let Ω be a partition. > > ExprMultiSet(Ω) = > { eval(expr,μ) | μ in Ω such that eval(μ(expr)) is defined } > UNION > { e | μ in Ω such that eval(μ(expr)) is undefined } > > where "e" is some symbol that is distinct from all RDF terms. > > card[x]: > if DISTINCT: > card[x] = 1 if there exists μ in Ω such that x = eval(μ(expr)) > card[x] = 0 otherwise > else > card[x] = count of μ in Ω such that x = eval(μ(expr)) I find the reuse of the term ExprMultiset as a function very confusing, but I think I understand the proposal. The current draft is not as clear as it should be but: AGGREGATE(ExprMultiset) on Ω results in Aggregation(GroupClause, ExprMultiset, AGGREGATE, Ω) So, by my understanding the end result of this proposal is: Aggregation(GroupClause, ExprMultiset, func, Ω) = { merge(k, func(S) | (k, Ω') in Partition(GroupClause, Ω) } where S = { eval(exp,μ') | exp in ExprMultiset, μ' in Ω' such that eval(exp,μ') is defined } UNION { e | exp in ExprMultiset, μ' in Ω' such that eval(exp,μ') is undefined } But perhaps I've missed the point? > -------- > > "e" just records error evaluations. > > This is the most flexible definition. An alternative is > > ExprMultiset(Ω) = > { eval(expr,μ) | μ in Ω such that eval(expr,μ) is defined } > > which is hard-coding dropping errors and unbounds during evaluation. > But the aggregate can't know there were some errors. Right. Do we have a usecase where this is important? I don't remember offhand whether SQL passes NULLs to aggregates, other than COUNT(*), but I think it doesn't. > Another possibility is that a yes/no flag indicating a error was > seen. But this might as well be the count of errors, which is > equivalent to the flexible definition given. Yes, somewhat. It complicates the definition of many of the aggregates to some degree, but that's not a huge burden. > By the way, this is in no way a recipe for implementation. > Aggregation can be done over all groups in parallel during query > execution. > > > > For the last publication, it was noted > > http://lists.w3.org/Archives/Public/public-rdf-dawg/2009OctDec/0646.html > > Unbound and error are the same. The current design so far has it > that any error means that the multiset is invalid and that group is > not considered. Right, this would tie us to a particular definition of COUNT(*), where unbounds and errors are both counted. I don't have any reason to prefer one definition over another. > We didn't have time to propose a solid design to address ISSUE-53 - > the potential design at the time of publication was that any error > when calculating the ExprMultiset from a partition meant that > > SUM of {1, 2, unbound} is an error. > COUNT of {1, 2, unbound} is an error. > > I don't think that is a useful form for COUNT(?x). It does seem to > mean that COUNT(?x) is either COUNT(*) or error; it can't be > anything else. This is assuming that we don't take something like your second definition, I think. > COUNT(?x) can not be zero because zero arises when there are no ?x > but there are solutions in the partition. If there are no solutions > in the partition then there is no group key and no grouping happens. > > For each aggregate we can decide what happens about unbounds and > errors. > > I would like to see: > > COUNT(*) = size of multiset. > COUNT(DISTINCT *) = size of set after removing any e (i.e. skip > undefs). I find the punning of * (or DISTINCT) here a bit unnatural. > COUNT(?x) = number of times ?x is defined in each group > 0 <= COUNT(?x) <= COUNT(*) > > COUNT(DISTINCT ?x) = number of times ?x is uniquely defined in each > group > > I'm less worried about SUM(?x) but I'd prefer that > > SUM(?x) = op:numeric-add of defined values of ?x, skips unbounds > > rather that the rigid form we currently have. > > Previously, one of the difficulties raised for this design was that > the operation to add two numbers wasn't op:numeric-add because that > could not cope the errors (there were related datatyping issues as > well). > > With the definition of ExprMultiSet above, op:numeric-add can be > used to define SUM. There is step between getting the ExprMultiSet > and the calculation of aggregation. This step, for SUM (and COUNT(? > x)), removes any errors. > > GROUP_CONCAT(?x) = concatenation > and now GROUP_CONCAT of an empty set can be defined as "". > > ------------- > Some examples: > > Does anyone want to suggest we design to get different results in > any of these cases? > > > --Data: > > @prefix : <http://example/> . > > :x1 a :T . > :x1 :p 1 . > :x1 :p 2 . > > :x2 a :T . > :x2 :p 9 . > > :x3 a :T . > :x3 :p 5 . > :x3 :q "x" . > > :x4 a :T . > :x4 :q "z". > > > -- > > > -- Query 1: > 1 PREFIX : <http://example/> > 2 > 3 SELECT ?x (count(*) AS ?C) > 4 WHERE > 5 { ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> :T > 6 OPTIONAL > 7 { ?x :p ?v} > 8 } > 9 GROUP BY ?x > 10 ORDER BY str(?x) > > ----------- > | x | C | > =========== > | :x1 | 2 | > | :x2 | 1 | > | :x3 | 1 | > | :x4 | 1 | > ----------- > > -- Query 2: > > Change line 3 to: > SELECT ?x (count(?v) AS ?C) > > ----------- > | x | C | > =========== > | :x1 | 2 | > | :x2 | 1 | > | :x3 | 1 | > | :x4 | 0 | > ----------- > > -- Query 3: > > Change line 3 to: > SELECT ?x (sum(?v) AS ?C) > > ----------- > | x | C | > =========== > | :x1 | 3 | > | :x2 | 9 | > | :x3 | 5 | > | :x4 | 0 | > ----------- > > The :x4 row is zero because there were no valid numbers to add > together. Arguably SUM({}) is an error, c.f. MIN({}). I can live with 0 though. I think the above all match what I would expect, but... > -- Different query OPTIONAL part - now has ?p > > 1 PREFIX : <http://example/> > 2 > 3 SELECT ?x (sum(?v) AS ?C) > 4 WHERE > 5 { ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> :T > 6 OPTIONAL > 7 { ?x ?any ?v} > 8 } > 9 GROUP BY ?x > 10 ORDER BY str(?x) > > ----------- > | x | C | > =========== > | :x1 | 3 | > | :x2 | 9 | > | :x3 | 5 | > | :x4 | 0 | > ----------- > > The case where ?v is "Z2 and "x" have been skipped. For this one I would expect: ----------- | x | C | =========== | :x1 | 3 | | :x2 | 9 | ----------- I would expect the 3,9,5,0 result from SELECT ?x (sum(xsd:decimal(?v)) AS ?C) or, more explicitly SELECT ?x (sum(COALESCE(xsd:decimal(?v), 0)) AS ?C) But, I can see an argument that RDF data has a tendency to be scruffy, so maybe users would expect this? However, it seems dangerous/ misleading. I certainly want some way to know that I've tried to sum a string and an integer. SELECT ?x (SUM(?v) AS ?C) expands to: :x1 SUM({1, 2}) :x2 SUM({9}) :x3 SUM({5, "x"}) :x4 SUM({"z"}) SELECT ?x (SUM(xsd:decimal(?v)) AS ?C) expands to :x1 SUM({1.0, 2.0}) :x2 SUM({9.0}) :x3 SUM({5.0, e}) :x4 SUM({e}) Or same same without the e's if the second form of Aggregation is used. - Steve -- Steve Harris, Garlik Limited 2 Sheen Road, Richmond, TW9 1AE, UK +44 20 8973 2465 http://www.garlik.com/ Registered in England and Wales 535 7233 VAT # 849 0517 11 Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9ADReceived on Sunday, 7 March 2010 21:34:18 GMT

*
This archive was generated by hypermail 2.3.1
: Tuesday, 26 March 2013 16:15:41 GMT
*