Re: Proposed definition of ExprMultiSet from Andy Seaborne on 2010-03-07 (public-rdf-dawg@w3.org from January to March 2010)

From: Andy Seaborne <andy.seaborne@talis.com>
Date: Sun, 07 Mar 2010 22:57:06 +0000
To: Steve Harris <steve.harris@garlik.com>
CC: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <4B942F42.4090705@talis.com>
Overall - we seem to have the start of a possible design and so this 
message is about details.

On 07/03/2010 9:33 PM, Steve Harris wrote:
> On 7 Mar 2010, at 17:42, Andy Seaborne wrote:
>
>> ISSUE-53
>>
>> I propose the following to define ExprMultiSet:
>>
>> -------
>>
>> Let Ω be a partition.
>>
>> ExprMultiSet(Ω) =
>> { eval(expr,μ) | μ in Ω such that eval(μ(expr)) is defined }
>> UNION
>> { e | μ in Ω such that eval(μ(expr)) is undefined }
>>
>> where "e" is some symbol that is distinct from all RDF terms.
>>
>> card[x]:
>> if DISTINCT:
>> card[x] = 1 if there exists μ in Ω such that x = eval(μ(expr))
>> card[x] = 0 otherwise
>> else
>> card[x] = count of μ in Ω such that x = eval(μ(expr))
>
> I find the reuse of the term ExprMultiset as a function very confusing,
> but I think I understand the proposal.

It's just trying to write the ExprMultiset based on Ω for which there is 
no notation.  I suppose it should involve μ.  It only about whether you 
like to write definitions with free terms or not.

"ExprMultiset based on Ω, expr = ... "

> The current draft is not as clear as it should be but:
> AGGREGATE(ExprMultiset) on Ω results in Aggregation(GroupClause,
> ExprMultiset, AGGREGATE, Ω)
>
> So, by my understanding the end result of this proposal is:
>
> Aggregation(GroupClause, ExprMultiset, func, Ω) =
> { merge(k, func(S) | (k, Ω') in Partition(GroupClause, Ω) }
>
> where
> S = { eval(exp,μ') | exp in ExprMultiset, μ' in Ω' such that
> eval(exp,μ') is defined }
> UNION
> { e | exp in ExprMultiset, μ' in Ω' such that eval(exp,μ') is undefined }
>
> But perhaps I've missed the point?

Don't think so.

Minor: The merge needs to involve a variable for func(S) but that's a 
separate issue and wasn;t in the published version either.  func(S) is a 
value, not a binding.

Minor: I wrote "eval(μ(expr))" when I remember to make the change from 
eval(expr,μ) because you'd used it earlier.

>> --------
>>
>> "e" just records error evaluations.
>>
>> This is the most flexible definition. An alternative is
>>
>> ExprMultiset(Ω) =
>> { eval(expr,μ) | μ in Ω such that eval(expr,μ) is defined }
>>
>> which is hard-coding dropping errors and unbounds during evaluation.
>> But the aggregate can't know there were some errors.
>
> Right. Do we have a usecase where this is important? I don't remember
> offhand whether SQL passes NULLs to aggregates, other than COUNT(*), but
> I think it doesn't.

It gives an account of COUNT(*) because it is the size of the MultiSet
   = number of rows
   = number of values + number of errors

SQL can treat nulls and errors differently in a way we can't.

>> Another possibility is that a yes/no flag indicating a error was seen.
>> But this might as well be the count of errors, which is equivalent to
>> the flexible definition given.
>
> Yes, somewhat. It complicates the definition of many of the aggregates
> to some degree, but that's not a huge burden.
>
>> By the way, this is in no way a recipe for implementation. Aggregation
>> can be done over all groups in parallel during query execution.
>>
>>
>>
>> For the last publication, it was noted
>>
>> http://lists.w3.org/Archives/Public/public-rdf-dawg/2009OctDec/0646.html
>>
>> Unbound and error are the same. The current design so far has it that
>> any error means that the multiset is invalid and that group is not
>> considered.
>
> Right, this would tie us to a particular definition of COUNT(*), where
> unbounds and errors are both counted. I don't have any reason to prefer
> one definition over another.

Sorry - don't understand.

COUNT(*) is the number of rows in the partition, no change there.

That can be explained in the main defns as number of values + number of 
errors  because that is the total size of the multiset.

>> We didn't have time to propose a solid design to address ISSUE-53 -
>> the potential design at the time of publication was that any error
>> when calculating the ExprMultiset from a partition meant that
>>
>> SUM of {1, 2, unbound} is an error.
>> COUNT of {1, 2, unbound} is an error.
>>
>> I don't think that is a useful form for COUNT(?x). It does seem to
>> mean that COUNT(?x) is either COUNT(*) or error; it can't be anything
>> else.
>
> This is assuming that we don't take something like your second
> definition, I think.

Yes - and the main flexible definition works. I'm referring to the 
version hinted at in the last publication at this point.  I hadn't 
realised that COUNT(?x) didn't work in the way it was going at the time 
of publication.

>> COUNT(?x) can not be zero because zero arises when there are no ?x but
>> there are solutions in the partition. If there are no solutions in the
>> partition then there is no group key and no grouping happens.
>>
>> For each aggregate we can decide what happens about unbounds and errors.
>>
>> I would like to see:
>>
>> COUNT(*) = size of multiset.
>> COUNT(DISTINCT *) = size of set after removing any e (i.e. skip undefs).
>
> I find the punning of * (or DISTINCT) here a bit unnatural.

I'm confused. I wasn't meaning any punning.  I'm just using the SPARQL 
syntax as proposed.

... COUNT examples ...

>>
>> -- Query 3:
>>
>> Change line 3 to:
>> SELECT ?x (sum(?v) AS ?C)
>>
>> -----------
>> | x | C |
>> ===========
>> | :x1 | 3 |
>> | :x2 | 9 |
>> | :x3 | 5 |
>> | :x4 | 0 |
>> -----------
>>
>> The :x4 row is zero because there were no valid numbers to add together.
>
> Arguably SUM({}) is an error, c.f. MIN({}). I can live with 0 though.

Happy with error as well especially if error cause unbound and not 
removal of the row with that key.

>
> I think the above all match what I would expect, but...
>
>> -- Different query OPTIONAL part - now has ?p
>>
>> 1 PREFIX : <http://example/>
>> 2
>> 3 SELECT ?x (sum(?v) AS ?C)
>> 4 WHERE
>> 5 { ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> :T
>> 6 OPTIONAL
>> 7 { ?x ?any ?v}
>> 8 }
>> 9 GROUP BY ?x
>> 10 ORDER BY str(?x)
>>
>> -----------
>> | x | C |
>> ===========
>> | :x1 | 3 |
>> | :x2 | 9 |
>> | :x3 | 5 |
>> | :x4 | 0 |
>> -----------
>>
>> The case where ?v is "Z2 and "x" have been skipped.
>
> For this one I would expect:
>
> -----------
> | x | C |
> ===========
> | :x1 | 3 |
> | :x2 | 9 |
> -----------

You're right - there's errors in the eval of fn:numeric-add.

We need to define what happens if agg(..) is an error.

An alternative is,

-----------
| x | C |
===========
| :x1 | 3 |
| :x2 | 9 |
| :x3 |   |
| :x4 |   |
-----------

which retains the group row (to distinguish from no key).

On reflection, that's my preferred design.

> I would expect the 3,9,5,0 result from
> SELECT ?x (sum(xsd:decimal(?v)) AS ?C)
> or, more explicitly
> SELECT ?x (sum(COALESCE(xsd:decimal(?v), 0)) AS ?C)

This results in 3.0, 9.0 5.0 (decimals) and 0^^xsd:integer.

> But, I can see an argument that RDF data has a tendency to be scruffy,

:-)

> so maybe users would expect this? However, it seems dangerous/misleading.

Agreed - let's record this as a decision to be made separately from the 
overall aggregate design.

>
> I certainly want some way to know that I've tried to sum a string and an
> integer.
>
> SELECT ?x (SUM(?v) AS ?C) expands to:
> :x1 SUM({1, 2})
> :x2 SUM({9})
> :x3 SUM({5, "x"})
> :x4 SUM({"z"})
>
> SELECT ?x (SUM(xsd:decimal(?v)) AS ?C) expands to
> :x1 SUM({1.0, 2.0})
> :x2 SUM({9.0})
> :x3 SUM({5.0, e})
> :x4 SUM({e})
>
> Or same same without the e's if the second form of Aggregation is used.
>
> - Steve
>
Received on Sunday, 7 March 2010 22:57:43 UTC