Re: Proposed definition of ExprMultiSet from Steve Harris on 2010-03-07 (public-rdf-dawg@w3.org from January to March 2010)

From: Steve Harris <steve.harris@garlik.com>
Date: Sun, 7 Mar 2010 21:33:49 +0000
To: Andy Seaborne <andy.seaborne@talis.com>
Cc: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-Id: <F0F0295C-938A-4C39-9148-2486D73C61CF@garlik.com>
On 7 Mar 2010, at 17:42, Andy Seaborne wrote:

> ISSUE-53
>
> I propose the following to define ExprMultiSet:
>
> -------
>
> Let Ω be a partition.
>
> ExprMultiSet(Ω) =
>  { eval(expr,μ) | μ in Ω such that eval(μ(expr)) is defined }
>  UNION
>  { e | μ in Ω such that  eval(μ(expr)) is undefined }
>
> where "e" is some symbol that is distinct from all RDF terms.
>
> card[x]:
>  if DISTINCT:
>     card[x] = 1 if there exists μ in Ω such that x =  eval(μ(expr))
>     card[x] = 0 otherwise
>  else
>     card[x] = count of μ in Ω such that x =  eval(μ(expr))

I find the reuse of the term ExprMultiset as a function very  
confusing, but I think I understand the proposal.

The current draft is not as clear as it should be but:
AGGREGATE(ExprMultiset) on Ω results in Aggregation(GroupClause,  
ExprMultiset, AGGREGATE, Ω)

So, by my understanding the end result of this proposal is:

Aggregation(GroupClause, ExprMultiset, func, Ω) =
    { merge(k, func(S) | (k, Ω') in Partition(GroupClause, Ω) }

where
S = { eval(exp,μ') | exp in ExprMultiset, μ' in Ω' such that  
eval(exp,μ') is defined }
     UNION
     { e | exp in ExprMultiset, μ' in Ω' such that eval(exp,μ') is  
undefined }

But perhaps I've missed the point?

> --------
>
> "e" just records error evaluations.
>
> This is the most flexible definition. An alternative is
>
> ExprMultiset(Ω) =
>  { eval(expr,μ) | μ in Ω such that eval(expr,μ) is defined }
>
> which is hard-coding dropping errors and unbounds during evaluation.  
> But the aggregate can't know there were some errors.

Right. Do we have a usecase where this is important? I don't remember  
offhand whether SQL passes NULLs to aggregates, other than COUNT(*),  
but I think it doesn't.

> Another possibility is that a yes/no flag indicating a error was  
> seen. But this might as well be the count of errors, which is  
> equivalent to the flexible definition given.

Yes, somewhat. It complicates the definition of many of the aggregates  
to some degree, but that's not a huge burden.

> By the way, this is in no way a recipe for implementation.   
> Aggregation can be done over all groups in parallel during query  
> execution.
>
>
>
> For the last publication, it was noted
>
> http://lists.w3.org/Archives/Public/public-rdf-dawg/2009OctDec/0646.html
>
> Unbound and error are the same. The current design so far has it  
> that any error means that the multiset is invalid and that group is  
> not considered.

Right, this would tie us to a particular definition of COUNT(*), where  
unbounds and errors are both counted. I don't have any reason to  
prefer one definition over another.

> We didn't have time to propose a solid design to address ISSUE-53 -  
> the potential design at the time of publication was that any error  
> when calculating the ExprMultiset from a partition meant that
>
> SUM of {1, 2, unbound} is an error.
> COUNT of {1, 2, unbound} is an error.
>
> I don't think that is a useful form for COUNT(?x).  It does seem to  
> mean that COUNT(?x) is either COUNT(*) or error; it can't be  
> anything else.

This is assuming that we don't take something like your second  
definition, I think.

> COUNT(?x) can not be zero because zero arises when there are no ?x  
> but there are solutions in the partition.  If there are no solutions  
> in the partition then there is no group key and no grouping happens.
>
> For each aggregate we can decide what happens about unbounds and  
> errors.
>
> I would like to see:
>
> COUNT(*) = size of multiset.
> COUNT(DISTINCT *) = size of set after removing any e (i.e. skip  
> undefs).

I find the punning of * (or DISTINCT) here a bit unnatural.

> COUNT(?x) = number of times ?x is defined in each group
>    0 <= COUNT(?x) <= COUNT(*)
>
> COUNT(DISTINCT ?x) = number of times ?x is uniquely defined in each  
> group
>
> I'm less worried about SUM(?x) but I'd prefer that
>
>  SUM(?x) = op:numeric-add of defined values of ?x, skips unbounds
>
> rather that the rigid form we currently have.
>
> Previously, one of the difficulties raised for this design was that  
> the operation to add two numbers wasn't op:numeric-add because that  
> could not cope the errors (there were related datatyping issues as  
> well).
>
> With the definition of ExprMultiSet above, op:numeric-add can be  
> used to define SUM.  There is step between getting the ExprMultiSet  
> and the calculation of aggregation.  This step, for SUM (and COUNT(? 
> x)), removes any errors.
>
> GROUP_CONCAT(?x) = concatenation
> and now GROUP_CONCAT of an empty set can be defined as "".
>
> -------------
> Some examples:
>
> Does anyone want to suggest we design to get different results in  
> any of these cases?
>
>
> --Data:
>
> @prefix : <http://example/> .
>
> :x1 a :T .
> :x1 :p 1 .
> :x1 :p 2 .
>
> :x2 a :T .
> :x2 :p 9 .
>
> :x3 a :T .
> :x3 :p 5 .
> :x3 :q "x" .
>
> :x4 a :T .
> :x4 :q "z".
>
>
> -- 
>
>
> -- Query 1:
>  1 PREFIX  :     <http://example/>
>  2
>  3 SELECT  ?x (count(*) AS ?C)
>  4 WHERE
>  5   { ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> :T
>  6     OPTIONAL
>  7       { ?x :p ?v}
>  8   }
>  9 GROUP BY ?x
> 10 ORDER BY str(?x)
>
> -----------
> | x   | C |
> ===========
> | :x1 | 2 |
> | :x2 | 1 |
> | :x3 | 1 |
> | :x4 | 1 |
> -----------
>
> -- Query 2:
>
> Change line 3 to:
>    SELECT  ?x (count(?v) AS ?C)
>
> -----------
> | x   | C |
> ===========
> | :x1 | 2 |
> | :x2 | 1 |
> | :x3 | 1 |
> | :x4 | 0 |
> -----------
>
> -- Query 3:
>
> Change line 3 to:
>    SELECT  ?x (sum(?v) AS ?C)
>
> -----------
> | x   | C |
> ===========
> | :x1 | 3 |
> | :x2 | 9 |
> | :x3 | 5 |
> | :x4 | 0 |
> -----------
>
> The :x4 row is zero because there were no valid numbers to add  
> together.

Arguably SUM({}) is an error, c.f. MIN({}). I can live with 0 though.

I think the above all match what I would expect, but...

> -- Different query OPTIONAL part - now has ?p
>
>  1 PREFIX  :     <http://example/>
>  2
>  3 SELECT  ?x (sum(?v) AS ?C)
>  4 WHERE
>  5   { ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> :T
>  6     OPTIONAL
>  7       { ?x ?any ?v}
>  8   }
>  9 GROUP BY ?x
> 10 ORDER BY str(?x)
>
> -----------
> | x   | C |
> ===========
> | :x1 | 3 |
> | :x2 | 9 |
> | :x3 | 5 |
> | :x4 | 0 |
> -----------
>
> The case  where ?v is "Z2 and "x" have been skipped.

For this one I would expect:

-----------
| x   | C |
===========
| :x1 | 3 |
| :x2 | 9 |
-----------

I would expect the 3,9,5,0 result from
    SELECT ?x (sum(xsd:decimal(?v)) AS ?C)
or, more explicitly
    SELECT ?x (sum(COALESCE(xsd:decimal(?v), 0)) AS ?C)
But, I can see an argument that RDF data has a tendency to be scruffy,  
so maybe users would expect this? However, it seems dangerous/ 
misleading.

I certainly want some way to know that I've tried to sum a string and  
an integer.

SELECT ?x (SUM(?v) AS ?C) expands to:
:x1  SUM({1, 2})
:x2  SUM({9})
:x3  SUM({5, "x"})
:x4  SUM({"z"})

SELECT ?x (SUM(xsd:decimal(?v)) AS ?C) expands to
:x1  SUM({1.0, 2.0})
:x2  SUM({9.0})
:x3  SUM({5.0, e})
:x4  SUM({e})

Or same same without the e's if the second form of Aggregation is used.

- Steve

-- 
Steve Harris, Garlik Limited
2 Sheen Road, Richmond, TW9 1AE, UK
+44 20 8973 2465  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10  
9AD
Received on Sunday, 7 March 2010 21:34:18 UTC