Re: Feedback on SPARQL 1.1 support for aggregates (was Re: W3C Seeks Feedback on Early Draft of SPARQL 1.1) from Axel Polleres on 2010-02-16 (public-rdf-dawg-comments@w3.org from February 2010)

From: Axel Polleres <axel.polleres@deri.org>
Date: Tue, 16 Feb 2010 16:25:30 +0000
To: Emanuele Della Valle <EMANUELE.DELLAVALLE@POLIMI.IT>, public-rdf-dawg-comments@w3.org
Message-Id: <E88418BD-C658-4035-AB60-8B3A100B7088@deri.org>
Hi Emanuele,

before giving official response to you from the WG, 
<chairhat-off> 
let me ask some clarifying questions <chairhat-off> 
for my better understanding:


1)

If I get it right, AGGREGATE {var  FUNCTION vars ) 

i) projects first groups wrt variables appearing in vars and then  
ii)  evaluates the aggregate on the those groups ...

That may make sense for count, but how does that work for 
min/max, i.e. where is the projection ? 

   hmmm... actually it seems the grammar given for FUNCTION is wrong... 
   it should be 
    
    Function | Function '(' var ')'
  or 
    Function | Function '(' vars ')'

2) On the following examples...


SELECT ?name ?surname ?book ?numberOfBooks ?averageNumberOfBooks
WHERE {
    ?auth :name ?name .
    ?auth :surname ?surname .
    ?auth :wrote ?book .
    ?auth :affiliated ?organization .
}
AGGREGATE { (?numberOfBooks, COUNT, {?auth} ) FILTER (?numberOfBooks > 5) }
AGGREGATE { (?affiliationBooks, SUM(?numberOfBooks), {?organization} )
FILTER (?affiliationBooks > 50)}

here, it seems that ?affiliationBooks violates the constraint you pose before:

"In the C-SPARQL language all the variables used in AGGREGATE clauses
must appear also in the SELECT clause, since aggregation happens after
standard SPARQL query evaluation. In SPARQL 1.1 the constraint is not
specified."

I assume you mean the constraint only to apply to  the variables mentioned in the function and
group part?

Also, I'd like to see the result table for this.

3) also, it seems to me that your SPARQL1.1 formulation of this one

SELECT ?topic ?numberOfSwissAuthors ?numberOfItalianAuthors
WHERE {
    ?auth :name ?name .
    ?auth :wrote ?book .
    ?book :topic ?topic .
    ?auth :hasNationality ?nat .
}
AGGREGATE { FILTER(?nat = 'IT') (?numberOfItalianAuthors, COUNT,
{?topic} ) }
AGGREGATE { FILTER(?nat = 'CH') (?numberOfSwissAuthors, COUNT, {?topic}
) FILTER(?numberOfItalianAuthors>?numberOfSwissAuthors)}

has some errors... e.g. why is the triple outside the subselects 

 ?auth :hasName ?name .

and you proejct auth? 
Wouldn't you want to project the ?topic instead?


Axel
</chairhat-off> 

On 9 Feb 2010, at 09:27, Emanuele Della Valle wrote:

> Dear Axel & Lee,
> 
> we [1] would like to comment on the support for aggregates in SPARQL
> 1.1. In the last two years we have been working on an extension to
> SPARQL for continuous querying over streams of RDF (namely C-SPARQL
> [2]). Central to stream processing is support for aggregates. For this
> reason we have defined and implemented  [3] our own support for
> aggregates in SPARQL which is orthogonal to the other stream processing 
> features of C-SPARQL. We believe such extension can be of general
> interest for SPARQL 1.1 WG.
> 
> In the rest of the mail we first introduce support for aggregates in
> C-SPARQL, then we compare C-SPARQL and SPARQL 1.1 support for
> aggregates. We show that:
> 1. C-SPARQL syntax for aggregates appears more compact and handy than
> the SPARQL 1.1 one,
> 2. all SPARQL 1.1 queries with aggregates can be expressed in C-SPARQL, and
> 3. there are queries that can be expressed in C-SPARQL but it is unclear
> whether they can be expressed in SPARQL 1.1.
> 
> This text is also available (in a much more readable form) at
> http://wiki.larkc.eu/c-sparql/sparql11-feedback
> 
> == Support For Aggregates In C-SPARQL ==
> 
> Aggregation clauses in C-SPARQL are added at the end of the query, and
> have the following syntax:
> 
> AggregateClause --> ( "AGGREGATE {(" var "," Function "," Group ")"
> [Filter] "}" )*
> Function --> "COUNT" | "SUM" | "AVG" | "MIN" | "MAX"
> Group --> var | "{" var ( "," var )* "}"
> 
> Every aggregation clause has the following three parts:
> 
> * The first part is a new variable (i.e., a variable not in the WHERE
> clause or in other aggregation clauses).
> * The second part is an aggregation function (one of: COUNT, MAX, MIN,
> SUM, AVG); COUNT may have no argument, while the other functions take
> one of the variables occurring in the WHERE clause as argument.
> * The third part is a set of one or more variables, which are chosen
> among those occurring in the WHERE clause.These variables express the
> grouping criteria.
> 
> Every clause may also have an optional fourth part, a FILTER clause.
> 
> == Example Of Simple Support For Aggregates In C-SPARQL And SPARQL 1.1 ==
> 
> Data:
> 
> @prefix : <http://books.example/> .
> :auth1  :name "Alice Foo", :writesBook :book1; :book2 .
> :auth2  :name "Bob Bar", :writesBook :book2 .
> 
> The following query counts the number of books written by an author and
> returns the name and the number of books.
> 
> SELECT ?name ?book ?numberOfBooks
> WHERE {
>    ?auth :name ?name .
>    ?auth :writesBook ?book .
> }
> AGGREGATE { (?numberOfBooks, COUNT, {?auth} ) }
> 
> The semantics of a query containing aggregates consists in adding new
> variable bindings computed by the WHERE clause to the existing regular
> variable bindings. For each of the new variables introduced by the
> AGGREGATE clauses, one new variable binding is added. The query result
> constructed in this way may be further filtered by a standard FILTER
> clause, which may refer to all the variables introduced in the WHERE and
> AGGREGATE clauses.
> 
> Our C-SPARQL extension is based on the conviction that in the context of
> RDF, knowledge should be extended rather than shrunk. Therefore, we
> propose to generate additional variable bindings and use them to
> annotate any existing variable binding that contributed to the aggregate
> value.
> 
> Results:
> ?name       | ?numberOfBooks
> ------------------------------
> "Alice Foo" |            "2"
> "Bob Bar"   |            "1"
> 
> This is in contrast to the conventional SQL grouping semantics that
> replaces all aggregated tuples with a single tuple representing the
> aggregate value. In this respect, we believe that our approach to
> aggregation is more aligned with the baseline of the SPARQL semantics.
> 
> Judging fro the example in Section 10 of the SPARQL 1.1 draft, the query
> above can be expressed in SPARQL 1.1 in the following way:
> 
> SELECT ?name ?numberOfBooks
> WHERE {
>     ?auth :name ?name .        {
>             SELECT ?auth (COUNT(?book) AS ?numberOfBooks)
>             WHERE {
>                     ?auth :writesBook ?book .
>             }
>             GROUP BY ?auth
>     }
> }
> 
> In the C-SPARQL language all the variables used in AGGREGATE clauses
> must appear also in the SELECT clause, since aggregation happens after
> standard SPARQL query evaluation. In SPARQL 1.1 the constraint is not
> specified.
> 
> Wrapping up, queries of this kind can be expressed both in SPARQL 1.1
> and C-SPARQL, but C-SPARQL syntax appears more compact and handy.
> 
> == Aggregates Supported In SPARQL 1.1 Are Also Supported In C-SPARQL ==
> 
> Given the current SPARQL 1.1 support for aggregates, it appears that all
> SPARQL 1.1 queries with aggregates can be expressed in C-SPARQL.
> 
> For instance, the following query is the example of SPARQL 1.1 support
> for aggregates appearing in Section 9 of the current draft.
> 
> PREFIX  <http://books.example/>
> SELECT (SUM(?lprice) AS ?totalPrice)
> WHERE {
>  ?org :affiliates ?auth .
>  ?auth :writesBook ?book .
>  ?book :price ?lprice .
> }
> GROUP BY ?org
> HAVING (SUM(?lprice) > 10)
> 
> Such a query in C-SPARQL will be written as follows.
> 
> PREFIX  <http://books.example/>
> SELECT ?totalPrice
> WHERE {
>  ?org :affiliates ?auth .
>  ?auth :writesBook ?book .
>  ?book :price ?lprice .
> }
> AGGREGATE { (?totalPrice, SUM(?lprice), {?org}) FILTER ( ?totalPrice >
> 10) }
> 
> == Queries That Can Be Expressed In C-SPARQL But It Is Unclear Whether
> They Can Be Expressed In SPARQL 1.1 ==
> 
> Given the current SPARQL 1.1 support for aggregates, it is unclear
> whether the following C-SPARQL queries can be expressed in SPARQL 1.1
> 
> Query: the average number of books written by authors that wrote at
> least 5 books.
> 
> SELECT ?name ?book ?numberOfBooks ?averageNumberOfBooks
> WHERE {
>    ?auth :name ?name .
>    ?auth :wrote ?book .
> }
> AGGREGATE { (?numberOfBooks, COUNT, {?auth} ) FILTER (?numberOfBooks > 5) }
> AGGREGATE { (?averageNumberOfBooks, AVG, {?numberOfBooks} ) }
> 
> A possible way to express it in SPARQL 1.1 is illustrated hereafter, but
> no examples in the current draft show that this is possible.
> 
> SELECT ?name ?surname ?book ?numberOfBooks (AVG(?numberOfBooks) AS
> ?averageNumberOfBooks)
> WHERE {
>     ?auth :hasSurname ?surname .
>     ?auth :hasName ?name .
>     {
>             SELECT ?auth (COUNT(?book) AS ?numberOfBooks)
>             WHERE {
>                     ?auth :wrote ?book .
>    }
>             GROUP BY ?auth
>            HAVING (?numberOfBooks > 5)
> }
> 
> More complex sequences of aggregation are supported in C-SPARQL, such as
> * computing the number of books per author
> * keeping only the authors who have published at least 5 books
> * computing the total number of books grouping by affiliation
> * filtering the affiliation with less than 50 books published
> 
> SELECT ?name ?surname ?book ?numberOfBooks ?averageNumberOfBooks
> WHERE {
>    ?auth :name ?name .
>    ?auth :surname ?surname .
>    ?auth :wrote ?book .
>    ?auth :affiliated ?organization .
> }
> AGGREGATE { (?numberOfBooks, COUNT, {?auth} ) FILTER (?numberOfBooks > 5) }
> AGGREGATE { (?affiliationBooks, SUM(?numberOfBooks), {?organization} )
> FILTER (?affiliationBooks > 50)}
> 
> A possible way to express it in SPARQL 1.1 is illustrated hereafter, but
> as above no examples in the current draft show that this is possible.
> 
> SELECT ?name ?surname ?book ?numberOfBooks
> WHERE {
>   ?auth :hasSurname ?surname .
>   ?auth :hasName ?name .
>   {
>    SELECT ?affiliation (SUM(?numberOfBooks) as ?affiliationBooks)
>    WHERE {
>     ?auth :affiliated ?organization .
>     {
>      SELECT ?auth (COUNT(?book) AS ?numberOfBooks)
>      WHERE {
>       ?auth :wrote ?book .
>      }
>      GROUP BY ?auth
>     HAVING (?numberOfBooks > 5)
>    }
>    GROUP BY ?organization
>   HAVING (?affiliationBooks > 50)
>   }
> 
> In C-SPARQL, evaluation of multiple aggregation with filtering clauses
> is possible. For instance, one can ask for the research topics for which
> the Italian authors are more than the Swiss ones.
> 
> SELECT ?topic ?numberOfSwissAuthors ?numberOfItalianAuthors
> WHERE {
>    ?auth :name ?name .
>    ?auth :wrote ?book .
>    ?book :topic ?topic .
>    ?auth :hasNationality ?nat .
> }
> AGGREGATE { FILTER(?nat = 'IT') (?numberOfItalianAuthors, COUNT,
> {?topic} ) }
> AGGREGATE { FILTER(?nat = 'CH') (?numberOfSwissAuthors, COUNT, {?topic}
> ) FILTER(?numberOfItalianAuthors>?numberOfSwissAuthors)}
> 
> Given the current draft of SPARQL 1.1, two alternative formulations
> appear possible, one using a FILTER and one using an HAVING. However,
> both require the SPARQL 1.1 engine to decide the order of execution,
> whereas in C-SPARQL the order is given explicitly.
> 
> SPARQL 1.1 version that uses the FILTER clause (see line 24):
> 
> 1. SELECT ?topic ?numberOfSwissAuthors ?numberOfItalianAuthors
> 2. WHERE {
> 3.     ?auth :hasName ?name .
> 4.     {
> 5.             SELECT ?auth (COUNT(?book) AS ?numberOfSwissAuthors)
> 6.             WHERE {
> 7.                     ?auth :wrote ?book .
> 8.                     ?book :topic ?topic .
> 9.                     ?auth :hasNationality ?nat .
> 10.                     FILTER(?nat = 'CH') .
> 11.             }
> 12.             GROUP BY ?topic
> 13.     }
> 14.     {
> 15.             SELECT ?auth (COUNT(?book) AS ?numberOfItalianAuthors)
> 16.             WHERE {
> 17.                     ?auth :wrote ?book .
> 18.                     ?book :topic ?topic .
> 19.                     ?auth :hasNationality ?nat .
> 20.                     FILTER(?nat = 'IT') .
> 21.             }
> 22.             GROUP BY ?topic    23.     }
> 24.     FILTER(?numberOfItalianAuthors>?numberOfSwissAuthors)    25. }
> 
> SPARQL 1.1 version that uses the HAVING clause (see line 23):
> 
> 1. SELECT ?topic ?numberOfSwissAuthors ?numberOfItalianAuthors
> 2. WHERE {
> 3.     ?auth :hasName ?name .
> 4.     {
> 5.             SELECT ?auth (COUNT(?book) AS ?numberOfSwissAuthors)
> 6.             WHERE {
> 7.                     ?auth :wrote ?book .
> 8.                     ?book :topic ?topic .
> 9.                     ?auth :hasNationality ?nat .
> 10.                     FILTER(?nat = 'CH') .
> 11.             }
> 12.             GROUP BY ?topic
> 13.     }
> 14.     {
> 15.             SELECT ?auth (COUNT(?book) AS ?numberOfItalianAuthors)
> 16.             WHERE {
> 17.                     ?auth :wrote ?book .
> 18.                     ?book :topic ?topic .
> 19.                     ?auth :hasNationality ?nat .
> 20.                     FILTER(?nat = 'IT') .
> 21.             }
> 22.             GROUP BY ?topic  23.            
> HAVING(?numberOfItalianAuthors>?numberOfSwissAuthors)
> 24.     }   25. }
> 
> Therefore, we believe that there are queries that can be expressed in
> C-SPARQL but not in SPARQL 1.1.
> 
> == Computing Multiple Independent Aggregates At The Same Time ==
> 
> As we explained in the introduction, C-SPARQL was explicitly designed
> for processing RDF streams. The transient nature of streams poses the
> requirement to compute multiple (possibly independent) aggregates at the
> same time in the same query, because assuring that two independent
> queries process exactly the same data is very difficult. Therefore,
> multiple independent aggregations are also allowed within the same
> C-SPARQL query, with different grouping criteria and different
> partitions over the same set of bindings, thus pushing the aggregation
> capabilities beyond those of SQL.
> 
> The following query counts the number of books written by an author,
> counts the number of authors per book and returns the name, the book,
> the number of books and the number of authors.
> 
> SELECT ?name ?book ?numberOfBooks ?numberOfAuthors
> WHERE {
>    ?auth :name ?name .
>    ?auth :wrote ?book .
> }
> AGGREGATE { (?numberOfBooks, COUNT, {?auth} ) }
> AGGREGATE { (?numberOfAuthors, COUNT, {?book} ) }
> 
> Results:
> 
> ?name       | ?book | ?numberOfBooks | ?numberOfAuthors
> ---------------------------------------------------------
> "Alice Foo" |    b1 |            "2" |              "1"
> "Alice Foo" |    b2 |            "2" |              "2"
> "Bob   Bar" |    b2 |            "1" |              "2"
> 
> Judging from SPARQL 1.1 draft, the query above can be expressed in
> SPARQL 1.1 in the following way:
> 
> SELECT ?name ?surname ?book ?numberOfBooks ?numberOfAuthors
> WHERE {
>     ?auth :hasSurname ?surname .
>     ?auth :hasName ?name .
>     {
>             SELECT ?auth (COUNT(?book) AS ?numberOfBooks)
>             WHERE {
>                     ?auth :wrote ?book .
>             }
>             GROUP BY ?auth
>     }
>     {
>             SELECT  ?auth (COUNT(?auth) AS ?numberOfAuthors)
>             WHERE {
>                    ?auth :wrote ?book .
>             }
>             GROUP BY ?book
>     }
> }
> 
> == Conclusion ==
> 
> Clearly, the C-SPARQL notation is less cumbersome and more concise than
> the SPARQL 1.1 one. It is therefore, easier to express and understand
> aggregates in C-SPARQL than in the current draft of SPARQL 1.1, which we
> believe is an important factor in language design and adoption.
> 
> == References ==
> 
> [1] http://dbgroup.elet.polimi.it/
> [2] http://wiki.larkc.eu/c-sparql
> [3]
> http://www.larkc.eu/wp-content/uploads/2008/01/larkc_d33-description-of-strategy-and-design-for-data-stream-management-approaches_final.pdf
> 
> 
> Axel Polleres ha scritto:
> > The SPARQL WG has published  a First Public Working Draft "SPARQL 1.1
> > Property Paths"
> > (http://www.w3.org/TR/2010/WD-sparql11-property-paths-20100126/) which
> > defines a more succinct way to write parts of basic graph patterns and
> > also extend matching of triple pattern to arbitrary length paths. The
> > group also published six updates listed below:
> >
> > * SPARQL 1.1 Query (new working draft:
> > http://www.w3.org/TR/2010/WD-sparql11-query-20010126/) - Adds support
> > for aggregates, subqueries, projected expressions, and negation to the
> > SPARQL query language.
> > * SPARQL 1.1 Update (new working draft:
> > http://www.w3.org/TR/2010/WD-sparql11-update-20010126/) - - Defines an
> > update language for RDF graphs.
> > * SPARQL 1.1 Protocol for RDF (new working draft:
> > http://www.w3.org/TR/2010/WD-sparql11-protocol-20010126/) - Defines an
> > abstract interface and HTTP bindings for a protocol to issue SPARQL
> > Query and SPARQL Update statements against a SPARQL endpoint.
> > * SPARQL 1.1 Service Description (new working draft:
> > http://www.w3.org/TR/2010/WD-sparql11-service-description-20010126/) -
> > Defines a vocabulary and discovery mechanism for describing the
> > capabilities of a SPARQL endpoint.
> > * SPARQL 1.1 Uniform HTTP Protocol for Managing RDF Graphs (new
> > working draft:
> > http://www.w3.org/TR/2010/WD-sparql11-http-rdf-update-20010126/) -
> > Describes the use of the HTTP protocol for managing named RDF graphs
> > on an HTTP server.
> > * SPARQL 1.1 Entailment Regimes (new working draft:
> > http://www.w3.org/TR/2010/WD-sparql11-entailment-20010126/) - Defines
> > conditions under which SPARQL queries can be used with entailment
> > regimes such as RDF, RDF Schema, OWL, or RIF.
> >
> > The working group is seeking feedback on these drafts from, especially
> > on those points currently marked with ISSUEs currently under
> > discussion, cf. [1]. If you wish to make comments regarding these
> > documents, please send them to
> >     public-rdf-dawg-comments@w3.org
> >
> > with best regards,
> > Axel Polleres & Lee Feigenbaum (on behalf of the SPARQL WG)
> >
> > 1. http://www.w3.org/2009/sparql/track/issues/open
> >  
> 
> 
> 
> 
>
Received on Tuesday, 16 February 2010 16:28:06 UTC