- From: Andy Seaborne <andy.seaborne@talis.com>
- Date: Sun, 07 Mar 2010 22:57:06 +0000
- To: Steve Harris <steve.harris@garlik.com>
- CC: SPARQL Working Group <public-rdf-dawg@w3.org>
Overall - we seem to have the start of a possible design and so this message is about details. On 07/03/2010 9:33 PM, Steve Harris wrote: > On 7 Mar 2010, at 17:42, Andy Seaborne wrote: > >> ISSUE-53 >> >> I propose the following to define ExprMultiSet: >> >> ------- >> >> Let Ω be a partition. >> >> ExprMultiSet(Ω) = >> { eval(expr,μ) | μ in Ω such that eval(μ(expr)) is defined } >> UNION >> { e | μ in Ω such that eval(μ(expr)) is undefined } >> >> where "e" is some symbol that is distinct from all RDF terms. >> >> card[x]: >> if DISTINCT: >> card[x] = 1 if there exists μ in Ω such that x = eval(μ(expr)) >> card[x] = 0 otherwise >> else >> card[x] = count of μ in Ω such that x = eval(μ(expr)) > > I find the reuse of the term ExprMultiset as a function very confusing, > but I think I understand the proposal. It's just trying to write the ExprMultiset based on Ω for which there is no notation. I suppose it should involve μ. It only about whether you like to write definitions with free terms or not. "ExprMultiset based on Ω, expr = ... " > The current draft is not as clear as it should be but: > AGGREGATE(ExprMultiset) on Ω results in Aggregation(GroupClause, > ExprMultiset, AGGREGATE, Ω) > > So, by my understanding the end result of this proposal is: > > Aggregation(GroupClause, ExprMultiset, func, Ω) = > { merge(k, func(S) | (k, Ω') in Partition(GroupClause, Ω) } > > where > S = { eval(exp,μ') | exp in ExprMultiset, μ' in Ω' such that > eval(exp,μ') is defined } > UNION > { e | exp in ExprMultiset, μ' in Ω' such that eval(exp,μ') is undefined } > > But perhaps I've missed the point? Don't think so. Minor: The merge needs to involve a variable for func(S) but that's a separate issue and wasn;t in the published version either. func(S) is a value, not a binding. Minor: I wrote "eval(μ(expr))" when I remember to make the change from eval(expr,μ) because you'd used it earlier. >> -------- >> >> "e" just records error evaluations. >> >> This is the most flexible definition. An alternative is >> >> ExprMultiset(Ω) = >> { eval(expr,μ) | μ in Ω such that eval(expr,μ) is defined } >> >> which is hard-coding dropping errors and unbounds during evaluation. >> But the aggregate can't know there were some errors. > > Right. Do we have a usecase where this is important? I don't remember > offhand whether SQL passes NULLs to aggregates, other than COUNT(*), but > I think it doesn't. It gives an account of COUNT(*) because it is the size of the MultiSet = number of rows = number of values + number of errors SQL can treat nulls and errors differently in a way we can't. >> Another possibility is that a yes/no flag indicating a error was seen. >> But this might as well be the count of errors, which is equivalent to >> the flexible definition given. > > Yes, somewhat. It complicates the definition of many of the aggregates > to some degree, but that's not a huge burden. > >> By the way, this is in no way a recipe for implementation. Aggregation >> can be done over all groups in parallel during query execution. >> >> >> >> For the last publication, it was noted >> >> http://lists.w3.org/Archives/Public/public-rdf-dawg/2009OctDec/0646.html >> >> Unbound and error are the same. The current design so far has it that >> any error means that the multiset is invalid and that group is not >> considered. > > Right, this would tie us to a particular definition of COUNT(*), where > unbounds and errors are both counted. I don't have any reason to prefer > one definition over another. Sorry - don't understand. COUNT(*) is the number of rows in the partition, no change there. That can be explained in the main defns as number of values + number of errors because that is the total size of the multiset. >> We didn't have time to propose a solid design to address ISSUE-53 - >> the potential design at the time of publication was that any error >> when calculating the ExprMultiset from a partition meant that >> >> SUM of {1, 2, unbound} is an error. >> COUNT of {1, 2, unbound} is an error. >> >> I don't think that is a useful form for COUNT(?x). It does seem to >> mean that COUNT(?x) is either COUNT(*) or error; it can't be anything >> else. > > This is assuming that we don't take something like your second > definition, I think. Yes - and the main flexible definition works. I'm referring to the version hinted at in the last publication at this point. I hadn't realised that COUNT(?x) didn't work in the way it was going at the time of publication. >> COUNT(?x) can not be zero because zero arises when there are no ?x but >> there are solutions in the partition. If there are no solutions in the >> partition then there is no group key and no grouping happens. >> >> For each aggregate we can decide what happens about unbounds and errors. >> >> I would like to see: >> >> COUNT(*) = size of multiset. >> COUNT(DISTINCT *) = size of set after removing any e (i.e. skip undefs). > > I find the punning of * (or DISTINCT) here a bit unnatural. I'm confused. I wasn't meaning any punning. I'm just using the SPARQL syntax as proposed. ... COUNT examples ... >> >> -- Query 3: >> >> Change line 3 to: >> SELECT ?x (sum(?v) AS ?C) >> >> ----------- >> | x | C | >> =========== >> | :x1 | 3 | >> | :x2 | 9 | >> | :x3 | 5 | >> | :x4 | 0 | >> ----------- >> >> The :x4 row is zero because there were no valid numbers to add together. > > Arguably SUM({}) is an error, c.f. MIN({}). I can live with 0 though. Happy with error as well especially if error cause unbound and not removal of the row with that key. > > I think the above all match what I would expect, but... > >> -- Different query OPTIONAL part - now has ?p >> >> 1 PREFIX : <http://example/> >> 2 >> 3 SELECT ?x (sum(?v) AS ?C) >> 4 WHERE >> 5 { ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> :T >> 6 OPTIONAL >> 7 { ?x ?any ?v} >> 8 } >> 9 GROUP BY ?x >> 10 ORDER BY str(?x) >> >> ----------- >> | x | C | >> =========== >> | :x1 | 3 | >> | :x2 | 9 | >> | :x3 | 5 | >> | :x4 | 0 | >> ----------- >> >> The case where ?v is "Z2 and "x" have been skipped. > > For this one I would expect: > > ----------- > | x | C | > =========== > | :x1 | 3 | > | :x2 | 9 | > ----------- You're right - there's errors in the eval of fn:numeric-add. We need to define what happens if agg(..) is an error. An alternative is, ----------- | x | C | =========== | :x1 | 3 | | :x2 | 9 | | :x3 | | | :x4 | | ----------- which retains the group row (to distinguish from no key). On reflection, that's my preferred design. > I would expect the 3,9,5,0 result from > SELECT ?x (sum(xsd:decimal(?v)) AS ?C) > or, more explicitly > SELECT ?x (sum(COALESCE(xsd:decimal(?v), 0)) AS ?C) This results in 3.0, 9.0 5.0 (decimals) and 0^^xsd:integer. > But, I can see an argument that RDF data has a tendency to be scruffy, :-) > so maybe users would expect this? However, it seems dangerous/misleading. Agreed - let's record this as a decision to be made separately from the overall aggregate design. > > I certainly want some way to know that I've tried to sum a string and an > integer. > > SELECT ?x (SUM(?v) AS ?C) expands to: > :x1 SUM({1, 2}) > :x2 SUM({9}) > :x3 SUM({5, "x"}) > :x4 SUM({"z"}) > > SELECT ?x (SUM(xsd:decimal(?v)) AS ?C) expands to > :x1 SUM({1.0, 2.0}) > :x2 SUM({9.0}) > :x3 SUM({5.0, e}) > :x4 SUM({e}) > > Or same same without the e's if the second form of Aggregation is used. > > - Steve >
Received on Sunday, 7 March 2010 22:57:43 UTC