Re: Proposed definition of ExprMultiSet from Andy Seaborne on 2010-03-29 (public-rdf-dawg@w3.org from January to March 2010)

From: Andy Seaborne <andy.seaborne@talis.com>
Date: Mon, 29 Mar 2010 23:07:11 +0100
To: Axel Polleres <axel.polleres@deri.org>
CC: SPARQL Working Group <public-rdf-dawg@w3.org>
Message-ID: <4BB1248F.1000404@talis.com>
On 26/03/2010 10:18 AM, Axel Polleres wrote:
>
> On 26 Mar 2010, at 09:57, Axel Polleres wrote:
>
>> short clarification request:
>>
>>> { eval(expr,μ) | μ in Ω such that eval(μ(expr)) is defined }
>>
>> by "is defined" you mean "is unequal to 'error'", yes?
>
> p.s.:
>
>   or do you mean such that μ(expr) is defined ?

μ is a substitution function (it's mapping from vars to terms).

μ(expr) is the rewrite of the expression with variables replaced by any 
defined values.

eval(μ(expr)) is the value of that (and errors for any unbound variables 
- they didn't get substitued and variables aren't RDF terms).

 Andy



>
> I think I get the intention... to be able to treat unbound different from error, yes?
>
>
> here my understanding of the proposal. Say we have:
>
>   ?X ?Y

Does not make sense later on:

Lets' try :

 >   ?Z ?X
>   -----
>   a  1
>   b  0
>   c
>   d  "bla"
>   e  1
>
> Here my understanding of the proposal:
>
>   COUNT( * ) ->  5

Agreed

>   COUNT( ?X ) ->  4

Agreed after ?X moved to right-hand column.


>   COUNT( DISTINCT ?X ) ->  3

Agreed.

>
> yes? if so, clear so far. now what about expressions?
>
>   COUNT( ?X * ?X || ?X * ?X )  ->   ?

""" SPARQL 1.0:
Note that logical-or operates on the effective boolean value of its 
arguments.
"""

So 3 as ?X*?X is only defined for 1,0,1

1||1 => true
0||0 => false

>   COUNT( DISTINCT (?X * ?X || ?X * ?X) )  ->   ?

2 (true and false of ||)

>
> concretely, what happens to the "bla" row that produces an error? what happens to the unbound row, that also producse an error when the expression is evaluated?

Same - eval(?x) is undefined as is eval("bla"*"bla")

>
> Thanks,
> Axel

 Andy

>
>>
>> What I mean to ask here... when I read the current section on Filter evaluation
>> http://www.w3.org/2009/sparql/docs/query-1.1/rq25.xml#evaluation
>> to me it seems that eval is always "defined", but it could be an error for reasons of mistyping or
>> values being unbound.
>>
>> Thanks for clarification on whether/what I might have overlooked here!
>>
>> Axel
>>
>> On 7 Mar 2010, at 17:42, Andy Seaborne wrote:
>>
>>> ISSUE-53
>>>
>>> I propose the following to define ExprMultiSet:
>>>
>>> -------
>>>
>>> Let Ω be a partition.
>>>
>>> ExprMultiSet(Ω) =
>>>    { eval(expr,μ) | μ in Ω such that eval(μ(expr)) is defined }
>>>    UNION
>>>    { e | μ in Ω such that  eval(μ(expr)) is undefined }
>>>
>>> where "e" is some symbol that is distinct from all RDF terms.
>>>
>>> card[x]:
>>>    if DISTINCT:
>>>       card[x] = 1 if there exists μ in Ω such that x =  eval(μ(expr))
>>>       card[x] = 0 otherwise
>>>    else
>>>       card[x] = count of μ in Ω such that x =  eval(μ(expr))
>>>
>>> --------
>>>
>>> "e" just records error evaluations.
>>>
>>> This is the most flexible definition. An alternative is
>>>
>>> ExprMultiset(Ω) =
>>>    { eval(expr,μ) | μ in Ω such that eval(expr,μ) is defined }
>>>
>>> which is hard-coding dropping errors and unbounds during evaluation. But
>>> the aggregate can't know there were some errors.
>>>
>>> Another possibility is that a yes/no flag indicating a error was seen.
>>> But this might as well be the count of errors, which is equivalent to
>>> the flexible definition given.
>>>
>>> By the way, this is in no way a recipe for implementation.  Aggregation
>>> can be done over all groups in parallel during query execution.
>>>
>>>
>>>
>>> For the last publication, it was noted
>>>
>>> http://lists.w3.org/Archives/Public/public-rdf-dawg/2009OctDec/0646.html
>>>
>>> Unbound and error are the same. The current design so far has it that
>>> any error means that the multiset is invalid and that group is not
>>> considered.
>>>
>>> We didn't have time to propose a solid design to address ISSUE-53 - the
>>> potential design at the time of publication was that any error when
>>> calculating the ExprMultiset from a partition meant that
>>>
>>> SUM of {1, 2, unbound} is an error.
>>> COUNT of {1, 2, unbound} is an error.
>>>
>>> I don't think that is a useful form for COUNT(?x).  It does seem to mean
>>> that COUNT(?x) is either COUNT(*) or error; it can't be anything else.
>>>
>>> COUNT(?x) can not be zero because zero arises when there are no ?x but
>>> there are solutions in the partition.  If there are no solutions in the
>>> partition then there is no group key and no grouping happens.
>>>
>>> For each aggregate we can decide what happens about unbounds and errors.
>>>
>>> I would like to see:
>>>
>>> COUNT(*) = size of multiset.
>>> COUNT(DISTINCT *) = size of set after removing any e (i.e. skip undefs).
>>>
>>> COUNT(?x) = number of times ?x is defined in each group
>>>      0<= COUNT(?x)<= COUNT(*)
>>>
>>> COUNT(DISTINCT ?x) = number of times ?x is uniquely defined in each group
>>>
>>> I'm less worried about SUM(?x) but I'd prefer that
>>>
>>>    SUM(?x) = op:numeric-add of defined values of ?x, skips unbounds
>>>
>>> rather that the rigid form we currently have.
>>>
>>> Previously, one of the difficulties raised for this design was that the
>>> operation to add two numbers wasn't op:numeric-add because that could
>>> not cope the errors (there were related datatyping issues as well).
>>>
>>> With the definition of ExprMultiSet above, op:numeric-add can be used to
>>> define SUM.  There is step between getting the ExprMultiSet and the
>>> calculation of aggregation.  This step, for SUM (and COUNT(?x)), removes
>>> any errors.
>>>
>>> GROUP_CONCAT(?x) = concatenation
>>> and now GROUP_CONCAT of an empty set can be defined as "".
>>>
>>> -------------
>>> Some examples:
>>>
>>> Does anyone want to suggest we design to get different results in any of
>>> these cases?
>>>
>>>
>>> --Data:
>>>
>>> @prefix :<http://example/>  .
>>>
>>> :x1 a :T .
>>> :x1 :p 1 .
>>> :x1 :p 2 .
>>>
>>> :x2 a :T .
>>> :x2 :p 9 .
>>>
>>> :x3 a :T .
>>> :x3 :p 5 .
>>> :x3 :q "x" .
>>>
>>> :x4 a :T .
>>> :x4 :q "z".
>>>
>>>
>>> --
>>>
>>>
>>> -- Query 1:
>>>    1 PREFIX  :<http://example/>
>>>    2
>>>    3 SELECT  ?x (count(*) AS ?C)
>>>    4 WHERE
>>>    5   { ?x<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  :T
>>>    6     OPTIONAL
>>>    7       { ?x :p ?v}
>>>    8   }
>>>    9 GROUP BY ?x
>>>   10 ORDER BY str(?x)
>>>
>>> -----------
>>> | x   | C |
>>> ===========
>>> | :x1 | 2 |
>>> | :x2 | 1 |
>>> | :x3 | 1 |
>>> | :x4 | 1 |
>>> -----------
>>>
>>> -- Query 2:
>>>
>>> Change line 3 to:
>>>      SELECT  ?x (count(?v) AS ?C)
>>>
>>> -----------
>>> | x   | C |
>>> ===========
>>> | :x1 | 2 |
>>> | :x2 | 1 |
>>> | :x3 | 1 |
>>> | :x4 | 0 |
>>> -----------
>>>
>>> -- Query 3:
>>>
>>> Change line 3 to:
>>>      SELECT  ?x (sum(?v) AS ?C)
>>>
>>> -----------
>>> | x   | C |
>>> ===========
>>> | :x1 | 3 |
>>> | :x2 | 9 |
>>> | :x3 | 5 |
>>> | :x4 | 0 |
>>> -----------
>>>
>>> The :x4 row is zero because there were no valid numbers to add together.
>>>
>>> -- Different query OPTIONAL part - now has ?p
>>>
>>>    1 PREFIX  :<http://example/>
>>>    2
>>>    3 SELECT  ?x (sum(?v) AS ?C)
>>>    4 WHERE
>>>    5   { ?x<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  :T
>>>    6     OPTIONAL
>>>    7       { ?x ?any ?v}
>>>    8   }
>>>    9 GROUP BY ?x
>>>   10 ORDER BY str(?x)
>>>
>>> -----------
>>> | x   | C |
>>> ===========
>>> | :x1 | 3 |
>>> | :x2 | 9 |
>>> | :x3 | 5 |
>>> | :x4 | 0 |
>>> -----------
>>>
>>> The case  where ?v is "Z2 and "x" have been skipped.
>>>
>>>         Andy
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
Received on Monday, 29 March 2010 22:07:51 UTC