- From: Andy Seaborne <andy.seaborne@talis.com>
- Date: Sun, 07 Mar 2010 17:42:08 +0000
- To: SPARQL Working Group <public-rdf-dawg@w3.org>
ISSUE-53
I propose the following to define ExprMultiSet:
-------
Let Ω be a partition.
ExprMultiSet(Ω) =
{ eval(expr,μ) | μ in Ω such that eval(μ(expr)) is defined }
UNION
{ e | μ in Ω such that eval(μ(expr)) is undefined }
where "e" is some symbol that is distinct from all RDF terms.
card[x]:
if DISTINCT:
card[x] = 1 if there exists μ in Ω such that x = eval(μ(expr))
card[x] = 0 otherwise
else
card[x] = count of μ in Ω such that x = eval(μ(expr))
--------
"e" just records error evaluations.
This is the most flexible definition. An alternative is
ExprMultiset(Ω) =
{ eval(expr,μ) | μ in Ω such that eval(expr,μ) is defined }
which is hard-coding dropping errors and unbounds during evaluation. But
the aggregate can't know there were some errors.
Another possibility is that a yes/no flag indicating a error was seen.
But this might as well be the count of errors, which is equivalent to
the flexible definition given.
By the way, this is in no way a recipe for implementation. Aggregation
can be done over all groups in parallel during query execution.
For the last publication, it was noted
http://lists.w3.org/Archives/Public/public-rdf-dawg/2009OctDec/0646.html
Unbound and error are the same. The current design so far has it that
any error means that the multiset is invalid and that group is not
considered.
We didn't have time to propose a solid design to address ISSUE-53 - the
potential design at the time of publication was that any error when
calculating the ExprMultiset from a partition meant that
SUM of {1, 2, unbound} is an error.
COUNT of {1, 2, unbound} is an error.
I don't think that is a useful form for COUNT(?x). It does seem to mean
that COUNT(?x) is either COUNT(*) or error; it can't be anything else.
COUNT(?x) can not be zero because zero arises when there are no ?x but
there are solutions in the partition. If there are no solutions in the
partition then there is no group key and no grouping happens.
For each aggregate we can decide what happens about unbounds and errors.
I would like to see:
COUNT(*) = size of multiset.
COUNT(DISTINCT *) = size of set after removing any e (i.e. skip undefs).
COUNT(?x) = number of times ?x is defined in each group
0 <= COUNT(?x) <= COUNT(*)
COUNT(DISTINCT ?x) = number of times ?x is uniquely defined in each group
I'm less worried about SUM(?x) but I'd prefer that
SUM(?x) = op:numeric-add of defined values of ?x, skips unbounds
rather that the rigid form we currently have.
Previously, one of the difficulties raised for this design was that the
operation to add two numbers wasn't op:numeric-add because that could
not cope the errors (there were related datatyping issues as well).
With the definition of ExprMultiSet above, op:numeric-add can be used to
define SUM. There is step between getting the ExprMultiSet and the
calculation of aggregation. This step, for SUM (and COUNT(?x)), removes
any errors.
GROUP_CONCAT(?x) = concatenation
and now GROUP_CONCAT of an empty set can be defined as "".
-------------
Some examples:
Does anyone want to suggest we design to get different results in any of
these cases?
--Data:
@prefix : <http://example/> .
:x1 a :T .
:x1 :p 1 .
:x1 :p 2 .
:x2 a :T .
:x2 :p 9 .
:x3 a :T .
:x3 :p 5 .
:x3 :q "x" .
:x4 a :T .
:x4 :q "z".
--
-- Query 1:
1 PREFIX : <http://example/>
2
3 SELECT ?x (count(*) AS ?C)
4 WHERE
5 { ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> :T
6 OPTIONAL
7 { ?x :p ?v}
8 }
9 GROUP BY ?x
10 ORDER BY str(?x)
-----------
| x | C |
===========
| :x1 | 2 |
| :x2 | 1 |
| :x3 | 1 |
| :x4 | 1 |
-----------
-- Query 2:
Change line 3 to:
SELECT ?x (count(?v) AS ?C)
-----------
| x | C |
===========
| :x1 | 2 |
| :x2 | 1 |
| :x3 | 1 |
| :x4 | 0 |
-----------
-- Query 3:
Change line 3 to:
SELECT ?x (sum(?v) AS ?C)
-----------
| x | C |
===========
| :x1 | 3 |
| :x2 | 9 |
| :x3 | 5 |
| :x4 | 0 |
-----------
The :x4 row is zero because there were no valid numbers to add together.
-- Different query OPTIONAL part - now has ?p
1 PREFIX : <http://example/>
2
3 SELECT ?x (sum(?v) AS ?C)
4 WHERE
5 { ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> :T
6 OPTIONAL
7 { ?x ?any ?v}
8 }
9 GROUP BY ?x
10 ORDER BY str(?x)
-----------
| x | C |
===========
| :x1 | 3 |
| :x2 | 9 |
| :x3 | 5 |
| :x4 | 0 |
-----------
The case where ?v is "Z2 and "x" have been skipped.
Andy
Received on Sunday, 7 March 2010 17:42:43 UTC