Feedback on SPARQL 1.1 support for aggregates (was Re: W3C Seeks Feedback on Early Draft of SPARQL 1.1) from Emanuele Della Valle on 2010-02-09 (semantic-web@w3.org from February 2010)

From: Emanuele Della Valle <emanuele.dellavalle@polimi.it>
Date: Tue, 09 Feb 2010 08:27:47 +0100
To: Axel Polleres <axel.polleres@deri.org>
CC: Semantic Web <semantic-web@w3.org>, Stefano Ceri <ceri@elet.polimi.it>, "dbarbieri@elet.polimi.it" <dbarbieri@elet.polimi.it>, Michael Grossniklaus <grossniklaus@elet.polimi.it>, Daniele Maria Braga <braga@elet.polimi.it>, Frank van Harmelen <Frank.van.Harmelen@cs.vu.nl>
Message-ID: <4B710E73.7010906@polimi.it>
Dear Axel & Lee,

we [1] would like to comment on the support for aggregates in SPARQL 
1.1. In the last two years we have been working on an extension to 
SPARQL for continuous querying over streams of RDF (namely C-SPARQL 
[2]). Central to stream processing is support for aggregates. For this 
reason we have defined and implemented  [3] our own support for 
aggregates in SPARQL which is orthogonal to the other stream processing  
features of C-SPARQL. We believe such extension can be of general 
interest for SPARQL 1.1 WG.

In the rest of the mail we first introduce support for aggregates in 
C-SPARQL, then we compare C-SPARQL and SPARQL 1.1 support for 
aggregates. We show that:
1. C-SPARQL syntax for aggregates appears more compact and handy than 
the SPARQL 1.1 one,
2. all SPARQL 1.1 queries with aggregates can be expressed in C-SPARQL, and
3. there are queries that can be expressed in C-SPARQL but it is unclear 
whether they can be expressed in SPARQL 1.1.

This text is also available (in a much more readable form) at 
http://wiki.larkc.eu/c-sparql/sparql11-feedback

== Support For Aggregates In C-SPARQL ==

Aggregation clauses in C-SPARQL are added at the end of the query, and 
have the following syntax:

 AggregateClause --> ( "AGGREGATE {(" var "," Function "," Group ")" 
[Filter] "}" )*
 Function --> "COUNT" | "SUM" | "AVG" | "MIN" | "MAX"
 Group --> var | "{" var ( "," var )* "}"

Every aggregation clause has the following three parts:

* The first part is a new variable (i.e., a variable not in the WHERE 
clause or in other aggregation clauses).
* The second part is an aggregation function (one of: COUNT, MAX, MIN, 
SUM, AVG); COUNT may have no argument, while the other functions take 
one of the variables occurring in the WHERE clause as argument.
* The third part is a set of one or more variables, which are chosen 
among those occurring in the WHERE clause.These variables express the 
grouping criteria.

Every clause may also have an optional fourth part, a FILTER clause.

== Example Of Simple Support For Aggregates In C-SPARQL And SPARQL 1.1 ==

Data:

@prefix : <http://books.example/> .
:auth1  :name "Alice Foo", :writesBook :book1; :book2 .
:auth2  :name "Bob Bar", :writesBook :book2 .

The following query counts the number of books written by an author and 
returns the name and the number of books.

SELECT ?name ?book ?numberOfBooks
WHERE {
    ?auth :name ?name .
    ?auth :writesBook ?book .
}
AGGREGATE { (?numberOfBooks, COUNT, {?auth} ) }

The semantics of a query containing aggregates consists in adding new 
variable bindings computed by the WHERE clause to the existing regular 
variable bindings. For each of the new variables introduced by the 
AGGREGATE clauses, one new variable binding is added. The query result 
constructed in this way may be further filtered by a standard FILTER 
clause, which may refer to all the variables introduced in the WHERE and 
AGGREGATE clauses.

Our C-SPARQL extension is based on the conviction that in the context of 
RDF, knowledge should be extended rather than shrunk. Therefore, we 
propose to generate additional variable bindings and use them to 
annotate any existing variable binding that contributed to the aggregate 
value.

Results:
?name       | ?numberOfBooks
------------------------------
"Alice Foo" |            "2"
"Bob Bar"   |            "1"

This is in contrast to the conventional SQL grouping semantics that 
replaces all aggregated tuples with a single tuple representing the 
aggregate value. In this respect, we believe that our approach to 
aggregation is more aligned with the baseline of the SPARQL semantics.

Judging fro the example in Section 10 of the SPARQL 1.1 draft, the query 
above can be expressed in SPARQL 1.1 in the following way:

SELECT ?name ?numberOfBooks
WHERE {
     ?auth :name ?name .    
     {
             SELECT ?auth (COUNT(?book) AS ?numberOfBooks)
             WHERE {
                     ?auth :writesBook ?book .
             }
             GROUP BY ?auth
     }
}

In the C-SPARQL language all the variables used in AGGREGATE clauses 
must appear also in the SELECT clause, since aggregation happens after 
standard SPARQL query evaluation. In SPARQL 1.1 the constraint is not 
specified.

Wrapping up, queries of this kind can be expressed both in SPARQL 1.1 
and C-SPARQL, but C-SPARQL syntax appears more compact and handy.

== Aggregates Supported In SPARQL 1.1 Are Also Supported In C-SPARQL ==

Given the current SPARQL 1.1 support for aggregates, it appears that all 
SPARQL 1.1 queries with aggregates can be expressed in C-SPARQL.

For instance, the following query is the example of SPARQL 1.1 support 
for aggregates appearing in Section 9 of the current draft.

PREFIX  <http://books.example/>
SELECT (SUM(?lprice) AS ?totalPrice)
WHERE {
  ?org :affiliates ?auth .
  ?auth :writesBook ?book .
  ?book :price ?lprice .
}
GROUP BY ?org
HAVING (SUM(?lprice) > 10)

Such a query in C-SPARQL will be written as follows.

PREFIX  <http://books.example/>
SELECT ?totalPrice
WHERE {
  ?org :affiliates ?auth .
  ?auth :writesBook ?book .
  ?book :price ?lprice .
}
AGGREGATE { (?totalPrice, SUM(?lprice), {?org}) FILTER ( ?totalPrice > 10) }

== Queries That Can Be Expressed In C-SPARQL But It Is Unclear Whether 
They Can Be Expressed In SPARQL 1.1 ==

Given the current SPARQL 1.1 support for aggregates, it is unclear 
whether the following C-SPARQL queries can be expressed in SPARQL 1.1

Query: the average number of books written by authors that wrote at 
least 5 books.

SELECT ?name ?book ?numberOfBooks ?averageNumberOfBooks
WHERE {
    ?auth :name ?name .
    ?auth :wrote ?book .
}
AGGREGATE { (?numberOfBooks, COUNT, {?auth} ) FILTER (?numberOfBooks > 5) }
AGGREGATE { (?averageNumberOfBooks, AVG, {?numberOfBooks} ) }

A possible way to express it in SPARQL 1.1 is illustrated hereafter, but 
no examples in the current draft show that this is possible.

SELECT ?name ?surname ?book ?numberOfBooks (AVG(?numberOfBooks) AS 
?averageNumberOfBooks)
WHERE {
     ?auth :hasSurname ?surname .
     ?auth :hasName ?name .
     {
             SELECT ?auth (COUNT(?book) AS ?numberOfBooks)
             WHERE {
                     ?auth :wrote ?book .
    }
             GROUP BY ?auth
            HAVING (?numberOfBooks > 5)
}

More complex sequences of aggregation are supported in C-SPARQL, such as
* computing the number of books per author
* keeping only the authors who have published at least 5 books
* computing the total number of books grouping by affiliation
* filtering the affiliation with less than 50 books published

SELECT ?name ?surname ?book ?numberOfBooks ?averageNumberOfBooks
WHERE {
    ?auth :name ?name .
    ?auth :surname ?surname .
    ?auth :wrote ?book .
    ?auth :affiliated ?organization .
}
AGGREGATE { (?numberOfBooks, COUNT, {?auth} ) FILTER (?numberOfBooks > 5) }
AGGREGATE { (?affiliationBooks, SUM(?numberOfBooks), {?organization} ) 
FILTER (?affiliationBooks > 50)}

A possible way to express it in SPARQL 1.1 is illustrated hereafter, but 
as above no examples in the current draft show that this is possible.

SELECT ?name ?surname ?book ?numberOfBooks
WHERE {
   ?auth :hasSurname ?surname .
   ?auth :hasName ?name .
   {
    SELECT ?affiliation (SUM(?numberOfBooks) as ?affiliationBooks)
    WHERE {
     ?auth :affiliated ?organization .
     {
      SELECT ?auth (COUNT(?book) AS ?numberOfBooks)
      WHERE {
       ?auth :wrote ?book .
      }
      GROUP BY ?auth
     HAVING (?numberOfBooks > 5)
    }
    GROUP BY ?organization
   HAVING (?affiliationBooks > 50)
   }

In C-SPARQL, evaluation of multiple aggregation with filtering clauses 
is possible. For instance, one can ask for the research topics for which 
the Italian authors are more than the Swiss ones.

SELECT ?topic ?numberOfSwissAuthors ?numberOfItalianAuthors
WHERE {
    ?auth :name ?name .
    ?auth :wrote ?book .
    ?book :topic ?topic .
    ?auth :hasNationality ?nat .
}
AGGREGATE { FILTER(?nat = 'IT') (?numberOfItalianAuthors, COUNT, 
{?topic} ) }
AGGREGATE { FILTER(?nat = 'CH') (?numberOfSwissAuthors, COUNT, {?topic} 
) FILTER(?numberOfItalianAuthors>?numberOfSwissAuthors)}

Given the current draft of SPARQL 1.1, two alternative formulations 
appear possible, one using a FILTER and one using an HAVING. However, 
both require the SPARQL 1.1 engine to decide the order of execution, 
whereas in C-SPARQL the order is given explicitly.

SPARQL 1.1 version that uses the FILTER clause (see line 24):

 1. SELECT ?topic ?numberOfSwissAuthors ?numberOfItalianAuthors
 2. WHERE {
 3.     ?auth :hasName ?name .
 4.     {
 5.             SELECT ?auth (COUNT(?book) AS ?numberOfSwissAuthors)
 6.             WHERE {
 7.                     ?auth :wrote ?book .
 8.                     ?book :topic ?topic .
 9.                     ?auth :hasNationality ?nat .
10.                     FILTER(?nat = 'CH') .
11.             }
12.             GROUP BY ?topic
13.     }
14.     {
15.             SELECT ?auth (COUNT(?book) AS ?numberOfItalianAuthors)
16.             WHERE {
17.                     ?auth :wrote ?book .
18.                     ?book :topic ?topic .
19.                     ?auth :hasNationality ?nat .
20.                     FILTER(?nat = 'IT') .
21.             }
22.             GROUP BY ?topic    
23.     }
24.     FILTER(?numberOfItalianAuthors>?numberOfSwissAuthors)    
25. }

SPARQL 1.1 version that uses the HAVING clause (see line 23):

 1. SELECT ?topic ?numberOfSwissAuthors ?numberOfItalianAuthors
 2. WHERE {
 3.     ?auth :hasName ?name .
 4.     {
 5.             SELECT ?auth (COUNT(?book) AS ?numberOfSwissAuthors)
 6.             WHERE {
 7.                     ?auth :wrote ?book .
 8.                     ?book :topic ?topic .
 9.                     ?auth :hasNationality ?nat .
10.                     FILTER(?nat = 'CH') .
11.             }
12.             GROUP BY ?topic
13.     }
14.     {
15.             SELECT ?auth (COUNT(?book) AS ?numberOfItalianAuthors)
16.             WHERE {
17.                     ?auth :wrote ?book .
18.                     ?book :topic ?topic .
19.                     ?auth :hasNationality ?nat .
20.                     FILTER(?nat = 'IT') .
21.             }
22.             GROUP BY ?topic  
23.             HAVING(?numberOfItalianAuthors>?numberOfSwissAuthors)
24.     }   
25. }

Therefore, we believe that there are queries that can be expressed in 
C-SPARQL but not in SPARQL 1.1.

== Computing Multiple Independent Aggregates At The Same Time ==

As we explained in the introduction, C-SPARQL was explicitly designed 
for processing RDF streams. The transient nature of streams poses the 
requirement to compute multiple (possibly independent) aggregates at the 
same time in the same query, because assuring that two independent 
queries process exactly the same data is very difficult. Therefore, 
multiple independent aggregations are also allowed within the same 
C-SPARQL query, with different grouping criteria and different 
partitions over the same set of bindings, thus pushing the aggregation 
capabilities beyond those of SQL.

The following query counts the number of books written by an author, 
counts the number of authors per book and returns the name, the book, 
the number of books and the number of authors.

SELECT ?name ?book ?numberOfBooks ?numberOfAuthors
WHERE {
    ?auth :name ?name .
    ?auth :wrote ?book .
}
AGGREGATE { (?numberOfBooks, COUNT, {?auth} ) }
AGGREGATE { (?numberOfAuthors, COUNT, {?book} ) }

Results:

?name       | ?book | ?numberOfBooks | ?numberOfAuthors
---------------------------------------------------------
"Alice Foo" |    b1 |            "2" |              "1"
"Alice Foo" |    b2 |            "2" |              "2"
"Bob   Bar" |    b2 |            "1" |              "2"

Judging from SPARQL 1.1 draft, the query above can be expressed in 
SPARQL 1.1 in the following way:

SELECT ?name ?surname ?book ?numberOfBooks ?numberOfAuthors
WHERE {
     ?auth :hasSurname ?surname .
     ?auth :hasName ?name .
     {
             SELECT ?auth (COUNT(?book) AS ?numberOfBooks)
             WHERE {
                     ?auth :wrote ?book .
             }
             GROUP BY ?auth
     }
     {
             SELECT  ?auth (COUNT(?auth) AS ?numberOfAuthors)
             WHERE {
                    ?auth :wrote ?book .
             }
             GROUP BY ?book
     }
}

== Conclusion ==

Clearly, the C-SPARQL notation is less cumbersome and more concise than 
the SPARQL 1.1 one. It is therefore, easier to express and understand 
aggregates in C-SPARQL than in the current draft of SPARQL 1.1, which we 
believe is an important factor in language design and adoption.

== References ==

[1] http://dbgroup.elet.polimi.it/
[2] http://wiki.larkc.eu/c-sparql
[3] 
http://www.larkc.eu/wp-content/uploads/2008/01/larkc_d33-description-of-strategy-and-design-for-data-stream-management-approaches_final.pdf

Axel Polleres ha scritto:
> The SPARQL WG has published  a First Public Working Draft "SPARQL 1.1 Property Paths" (http://www.w3.org/TR/2010/WD-sparql11-property-paths-20100126/) which defines a more succinct way to write parts of basic graph patterns and also extend matching of triple pattern to arbitrary length paths. The group also published six updates listed below:
>
> * SPARQL 1.1 Query (new working draft: http://www.w3.org/TR/2010/WD-sparql11-query-20010126/) - Adds support for aggregates, subqueries, projected expressions, and negation to the SPARQL query language.
> * SPARQL 1.1 Update (new working draft: http://www.w3.org/TR/2010/WD-sparql11-update-20010126/) - - Defines an update language for RDF graphs.
> * SPARQL 1.1 Protocol for RDF (new working draft: http://www.w3.org/TR/2010/WD-sparql11-protocol-20010126/) - Defines an abstract interface and HTTP bindings for a protocol to issue SPARQL Query and SPARQL Update statements against a SPARQL endpoint.
> * SPARQL 1.1 Service Description (new working draft: http://www.w3.org/TR/2010/WD-sparql11-service-description-20010126/) - Defines a vocabulary and discovery mechanism for describing the capabilities of a SPARQL endpoint.
> * SPARQL 1.1 Uniform HTTP Protocol for Managing RDF Graphs (new working draft: http://www.w3.org/TR/2010/WD-sparql11-http-rdf-update-20010126/) - Describes the use of the HTTP protocol for managing named RDF graphs on an HTTP server.
> * SPARQL 1.1 Entailment Regimes (new working draft: http://www.w3.org/TR/2010/WD-sparql11-entailment-20010126/) - Defines conditions under which SPARQL queries can be used with entailment regimes such as RDF, RDF Schema, OWL, or RIF.
>
> The working group is seeking feedback on these drafts from, especially on those points currently marked with ISSUEs currently under discussion, cf. [1]. If you wish to make comments regarding these documents, please send them to 
>
>     public-rdf-dawg-comments@w3.org
>
> with best regards,
> Axel Polleres & Lee Feigenbaum (on behalf of the SPARQL WG)
>
> 1. http://www.w3.org/2009/sparql/track/issues/open
>
Received on Tuesday, 9 February 2010 07:28:43 UTC