Re: Proposed definition of ExprMultiSet from Axel Polleres on 2010-03-26 (public-rdf-dawg@w3.org from January to March 2010)

From: Axel Polleres <axel.polleres@deri.org>
Date: Fri, 26 Mar 2010 10:18:32 +0000
To: Andy Seaborne <andy.seaborne@talis.com>, SPARQL Working Group <public-rdf-dawg@w3.org>
Cc: Axel Polleres <axel.polleres@deri.org>
Message-Id: <DED83A5C-6C5B-45C7-89F0-7BC86F7B890E@deri.org>
On 26 Mar 2010, at 09:57, Axel Polleres wrote:

> short clarification request:
> 
>> { eval(expr,μ) | μ in Ω such that eval(μ(expr)) is defined }
> 
> by "is defined" you mean "is unequal to 'error'", yes?

p.s.: 

 or do you mean such that μ(expr) is defined ?

I think I get the intention... to be able to treat unbound different from error, yes?


here my understanding of the proposal. Say we have: 

 ?X ?Y
 ----- 
 a  1
 b  0
 c 
 d  "bla"
 e  1

Here my understanding of the proposal:

 COUNT( * ) -> 5
 COUNT( ?X ) -> 4
 COUNT( DISTINCT ?X ) -> 3 
 
yes? if so, clear so far. now what about expressions?

 COUNT( ?X * ?X || ?X * ?X )  ->  ?
 COUNT( DISTINCT (?X * ?X || ?X * ?X) )  ->  ?

concretely, what happens to the "bla" row that produces an error? what happens to the unbound row, that also producse an error when the expression is evaluated?

Thanks,
Axel

> 
> What I mean to ask here... when I read the current section on Filter evaluation 
> http://www.w3.org/2009/sparql/docs/query-1.1/rq25.xml#evaluation
> to me it seems that eval is always "defined", but it could be an error for reasons of mistyping or
> values being unbound.
> 
> Thanks for clarification on whether/what I might have overlooked here!
> 
> Axel
> 
> On 7 Mar 2010, at 17:42, Andy Seaborne wrote:
> 
>> ISSUE-53
>> 
>> I propose the following to define ExprMultiSet:
>> 
>> -------
>> 
>> Let Ω be a partition.
>> 
>> ExprMultiSet(Ω) =
>>   { eval(expr,μ) | μ in Ω such that eval(μ(expr)) is defined }
>>   UNION
>>   { e | μ in Ω such that  eval(μ(expr)) is undefined }
>> 
>> where "e" is some symbol that is distinct from all RDF terms.
>> 
>> card[x]:
>>   if DISTINCT:
>>      card[x] = 1 if there exists μ in Ω such that x =  eval(μ(expr))
>>      card[x] = 0 otherwise
>>   else
>>      card[x] = count of μ in Ω such that x =  eval(μ(expr))
>> 
>> --------
>> 
>> "e" just records error evaluations.
>> 
>> This is the most flexible definition. An alternative is
>> 
>> ExprMultiset(Ω) =
>>   { eval(expr,μ) | μ in Ω such that eval(expr,μ) is defined }
>> 
>> which is hard-coding dropping errors and unbounds during evaluation. But
>> the aggregate can't know there were some errors.
>> 
>> Another possibility is that a yes/no flag indicating a error was seen.
>> But this might as well be the count of errors, which is equivalent to
>> the flexible definition given.
>> 
>> By the way, this is in no way a recipe for implementation.  Aggregation
>> can be done over all groups in parallel during query execution.
>> 
>> 
>> 
>> For the last publication, it was noted
>> 
>> http://lists.w3.org/Archives/Public/public-rdf-dawg/2009OctDec/0646.html
>> 
>> Unbound and error are the same. The current design so far has it that
>> any error means that the multiset is invalid and that group is not
>> considered.
>> 
>> We didn't have time to propose a solid design to address ISSUE-53 - the
>> potential design at the time of publication was that any error when
>> calculating the ExprMultiset from a partition meant that
>> 
>> SUM of {1, 2, unbound} is an error.
>> COUNT of {1, 2, unbound} is an error.
>> 
>> I don't think that is a useful form for COUNT(?x).  It does seem to mean
>> that COUNT(?x) is either COUNT(*) or error; it can't be anything else.
>> 
>> COUNT(?x) can not be zero because zero arises when there are no ?x but
>> there are solutions in the partition.  If there are no solutions in the
>> partition then there is no group key and no grouping happens.
>> 
>> For each aggregate we can decide what happens about unbounds and errors.
>> 
>> I would like to see:
>> 
>> COUNT(*) = size of multiset.
>> COUNT(DISTINCT *) = size of set after removing any e (i.e. skip undefs).
>> 
>> COUNT(?x) = number of times ?x is defined in each group
>>     0 <= COUNT(?x) <= COUNT(*)
>> 
>> COUNT(DISTINCT ?x) = number of times ?x is uniquely defined in each group
>> 
>> I'm less worried about SUM(?x) but I'd prefer that
>> 
>>   SUM(?x) = op:numeric-add of defined values of ?x, skips unbounds
>> 
>> rather that the rigid form we currently have.
>> 
>> Previously, one of the difficulties raised for this design was that the
>> operation to add two numbers wasn't op:numeric-add because that could
>> not cope the errors (there were related datatyping issues as well).
>> 
>> With the definition of ExprMultiSet above, op:numeric-add can be used to
>> define SUM.  There is step between getting the ExprMultiSet and the
>> calculation of aggregation.  This step, for SUM (and COUNT(?x)), removes
>> any errors.
>> 
>> GROUP_CONCAT(?x) = concatenation
>> and now GROUP_CONCAT of an empty set can be defined as "".
>> 
>> -------------
>> Some examples:
>> 
>> Does anyone want to suggest we design to get different results in any of
>> these cases?
>> 
>> 
>> --Data:
>> 
>> @prefix : <http://example/> .
>> 
>> :x1 a :T .
>> :x1 :p 1 .
>> :x1 :p 2 .
>> 
>> :x2 a :T .
>> :x2 :p 9 .
>> 
>> :x3 a :T .
>> :x3 :p 5 .
>> :x3 :q "x" .
>> 
>> :x4 a :T .
>> :x4 :q "z".
>> 
>> 
>> --
>> 
>> 
>> -- Query 1:
>>   1 PREFIX  :     <http://example/>
>>   2
>>   3 SELECT  ?x (count(*) AS ?C)
>>   4 WHERE
>>   5   { ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> :T
>>   6     OPTIONAL
>>   7       { ?x :p ?v}
>>   8   }
>>   9 GROUP BY ?x
>>  10 ORDER BY str(?x)
>> 
>> -----------
>> | x   | C |
>> ===========
>> | :x1 | 2 |
>> | :x2 | 1 |
>> | :x3 | 1 |
>> | :x4 | 1 |
>> -----------
>> 
>> -- Query 2:
>> 
>> Change line 3 to:
>>     SELECT  ?x (count(?v) AS ?C)
>> 
>> -----------
>> | x   | C |
>> ===========
>> | :x1 | 2 |
>> | :x2 | 1 |
>> | :x3 | 1 |
>> | :x4 | 0 |
>> -----------
>> 
>> -- Query 3:
>> 
>> Change line 3 to:
>>     SELECT  ?x (sum(?v) AS ?C)
>> 
>> -----------
>> | x   | C |
>> ===========
>> | :x1 | 3 |
>> | :x2 | 9 |
>> | :x3 | 5 |
>> | :x4 | 0 |
>> -----------
>> 
>> The :x4 row is zero because there were no valid numbers to add together.
>> 
>> -- Different query OPTIONAL part - now has ?p
>> 
>>   1 PREFIX  :     <http://example/>
>>   2
>>   3 SELECT  ?x (sum(?v) AS ?C)
>>   4 WHERE
>>   5   { ?x <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> :T
>>   6     OPTIONAL
>>   7       { ?x ?any ?v}
>>   8   }
>>   9 GROUP BY ?x
>>  10 ORDER BY str(?x)
>> 
>> -----------
>> | x   | C |
>> ===========
>> | :x1 | 3 |
>> | :x2 | 9 |
>> | :x3 | 5 |
>> | :x4 | 0 |
>> -----------
>> 
>> The case  where ?v is "Z2 and "x" have been skipped.
>> 
>>        Andy
>> 
>> 
>> 
>> 
>> 
>> 
>
Received on Friday, 26 March 2010 10:19:10 UTC