Re: Aggregation over unbound variables from Andy Seaborne on 2014-02-14 (public-sparql-dev@w3.org from January to March 2014)

From: Andy Seaborne <andy@apache.org>
Date: Fri, 14 Feb 2014 12:44:27 +0000
To: public-sparql-dev@w3.org
Message-ID: <52FE0FAB.1040103@apache.org>
Arthur,

What does the spec say?

http://www.w3.org/TR/sparql11-query/#defn_aggSum

sum is defined using XSD's op:numeric-add of the evaluation of the 
sum'ed expression over the grouped rows.

If one of the expressions is an error, then the whole aggregate is an 
error.  Unbound is not null, it's an error to try to get the value of 
the variable.

You can sum over expressions where items may be unbound with either of:

SUM(IF(bound(?x),?x,0))

SUM(COALESCE(?x,0))


     Andy

PS
 > We have found that at least two SPARQL implementations use the SQL
 > semantics

If you are naming implementations, could you name all of them?


On 14/02/14 00:29, Arthur Keen wrote:
> We are developing a SPARQL 1.1 implementation, and and are hoping for
> some guidance on the SPARQL 1.1. specification on how to deal with
> aggregation over unbound variables.
>
> I believe COUNT functions are the same in SQL and SPARQL.  but the other
> aggregates (sum/min/max/avg), seem to have different semantics (at least
> per Jena).
>
> The semantics of sum/min/max/avg are different w.r.t nulls (at least
> according ot Jena).
>
> In SQL, the sum/min/max/avg of a nullable column is the sum/min/max/avg
> of the *non-null* values. For example, Suppose you have the following
> data in table "foo":
>
>
>        name    | age
>     ----------+------
>        "Bob"   | 5
>        "Bob"   |
>        "Alice" | 3
>        "Alice" | 4
>
>
> Then the query, "select sum(age) from foo group by name" gives this result:
>
>
>        name    | sum
>     ----------+-------
>        "Bob"   | 5
>        "Alice" | 7
>
>
>
> In contrast, Jena returns NULL (i.e. unbound) if there are any nulls in
> the data:
>
>
>     -----------------------------------------------------
>     | name    | total | cnt | cntstar | avg | min | max |
>     =====================================================
>     | "Bob"   |       | 1   | 2       |     |     |     |
>     | "Alice" | 7     | 2   | 2       | 3.5 | 3   | 4   |
>     -----------------------------------------------------
>
>
>
>     Data:
>
>
>         @prefix foaf:       <http://xmlns.com/foaf/0.1/> ..
>
>         _:a  foaf:name       "Alice" ..
>
>         _:a  foaf:age        4 .
>
>         _:b  foaf:name       "Alice" .
>
>         _:b  foaf:age        3 .
>
>         _:c  foaf:name       "Bob" .
>
>         _:c  foaf:age        5 .
>
>         _:d  foaf:name       "Bob" ..
>
>
>     Query:
>
>         PREFIX foaf: <http://xmlns.com/foaf/0.1/>
>
>         SELECT ?name
>
>                 (sum(?age) as ?total)
>
>                 (count(?age) as ?cnt)
>
>                 (count(*) as ?cntstar)
>
>                 (avg(?age) as ?avg)
>
>                 (min(?age) as ?min)
>
>                 (max(?age) as ?max)
>
>         WHERE  {
>
>            ?x foaf:name  ?name .
>
>            OPTIONAL { ?x  foaf:age  ?age }
>
>         }
>
>         group by ?name
>
>
> We have found that at least two SPARQL implementations use the SQL
> semantics for this, so it would be of benefit to the SPARQL community to
> have consistent way to handle aggregation over unbound variables.
>
> Best regards
>
> Arthur Keen
Received on Friday, 14 February 2014 12:44:56 UTC