Proposed definition of ExprMultiSet from Andy Seaborne on 2010-03-07 (public-rdf-dawg@w3.org from January to March 2010)

From: Andy Seaborne <andy.seaborne@talis.com>
Date: Sun, 07 Mar 2010 17:42:08 +0000
To: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <4B93E570.80205@talis.com>
ISSUE-53

I propose the following to define ExprMultiSet:

-------

Let Ω be a partition.

ExprMultiSet(Ω) =
   { eval(expr,μ) | μ in Ω such that eval(μ(expr)) is defined }
   UNION
   { e | μ in Ω such that  eval(μ(expr)) is undefined }

where "e" is some symbol that is distinct from all RDF terms.

card[x]:
   if DISTINCT:
      card[x] = 1 if there exists μ in Ω such that x =  eval(μ(expr))
      card[x] = 0 otherwise
   else
      card[x] = count of μ in Ω such that x =  eval(μ(expr))

--------

"e" just records error evaluations.

This is the most flexible definition. An alternative is

ExprMultiset(Ω) =
   { eval(expr,μ) | μ in Ω such that eval(expr,μ) is defined }

which is hard-coding dropping errors and unbounds during evaluation. But 
the aggregate can't know there were some errors.

Another possibility is that a yes/no flag indicating a error was seen. 
But this might as well be the count of errors, which is equivalent to 
the flexible definition given.

By the way, this is in no way a recipe for implementation.  Aggregation 
can be done over all groups in parallel during query execution.



For the last publication, it was noted

http://lists.w3.org/Archives/Public/public-rdf-dawg/2009OctDec/0646.html

Unbound and error are the same. The current design so far has it that 
any error means that the multiset is invalid and that group is not 
considered.

We didn't have time to propose a solid design to address ISSUE-53 - the 
potential design at the time of publication was that any error when 
calculating the ExprMultiset from a partition meant that

SUM of {1, 2, unbound} is an error.
COUNT of {1, 2, unbound} is an error.

I don't think that is a useful form for COUNT(?x).  It does seem to mean 
that COUNT(?x) is either COUNT(*) or error; it can't be anything else.

COUNT(?x) can not be zero because zero arises when there are no ?x but 
there are solutions in the partition.  If there are no solutions in the 
partition then there is no group key and no grouping happens.

For each aggregate we can decide what happens about unbounds and errors.

I would like to see:

COUNT(*) = size of multiset.
COUNT(DISTINCT *) = size of set after removing any e (i.e. skip undefs).

COUNT(?x) = number of times ?x is defined in each group
     0 <= COUNT(?x) <= COUNT(*)

COUNT(DISTINCT ?x) = number of times ?x is uniquely defined in each group

I'm less worried about SUM(?x) but I'd prefer that

   SUM(?x) = op:numeric-add of defined values of ?x, skips unbounds

rather that the rigid form we currently have.

Previously, one of the difficulties raised for this design was that the 
operation to add two numbers wasn't op:numeric-add because that could 
not cope the errors (there were related datatyping issues as well).

With the definition of ExprMultiSet above, op:numeric-add can be used to 
define SUM.  There is step between getting the ExprMultiSet and the 
calculation of aggregation.  This step, for SUM (and COUNT(?x)), removes 
any errors.

GROUP_CONCAT(?x) = concatenation
and now GROUP_CONCAT of an empty set can be defined as "".

-------------
Some examples:

Does anyone want to suggest we design to get different results in any of 
these cases?


--Data:

@prefix : <http://example/> .

:x1 a :T .
:x1 :p 1 .
:x1 :p 2 .

:x2 a :T .
:x2 :p 9 .

:x3 a :T .
:x3 :p 5 .
:x3 :q "x" .

:x4 a :T .
:x4 :q "z".


-- 


-- Query 1:
   1 PREFIX  :     <http://example/>
   2
   3 SELECT  ?x (count(*) AS ?C)
   4 WHERE
   5   { ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> :T
   6     OPTIONAL
   7       { ?x :p ?v}
   8   }
   9 GROUP BY ?x
  10 ORDER BY str(?x)

-----------
| x   | C |
===========
| :x1 | 2 |
| :x2 | 1 |
| :x3 | 1 |
| :x4 | 1 |
-----------

-- Query 2:

Change line 3 to:
     SELECT  ?x (count(?v) AS ?C)

-----------
| x   | C |
===========
| :x1 | 2 |
| :x2 | 1 |
| :x3 | 1 |
| :x4 | 0 |
-----------

-- Query 3:

Change line 3 to:
     SELECT  ?x (sum(?v) AS ?C)

-----------
| x   | C |
===========
| :x1 | 3 |
| :x2 | 9 |
| :x3 | 5 |
| :x4 | 0 |
-----------

The :x4 row is zero because there were no valid numbers to add together.

-- Different query OPTIONAL part - now has ?p

   1 PREFIX  :     <http://example/>
   2
   3 SELECT  ?x (sum(?v) AS ?C)
   4 WHERE
   5   { ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> :T
   6     OPTIONAL
   7       { ?x ?any ?v}
   8   }
   9 GROUP BY ?x
  10 ORDER BY str(?x)

-----------
| x   | C |
===========
| :x1 | 3 |
| :x2 | 9 |
| :x3 | 5 |
| :x4 | 0 |
-----------

The case  where ?v is "Z2 and "x" have been skipped.

 Andy
Received on Sunday, 7 March 2010 17:42:43 UTC