Implementing SPARQL 1.1 Query - feedback on Oct 14 WD from Jeen Broekstra on 2011-02-05 (public-rdf-dawg-comments@w3.org from February 2011)

From: Jeen Broekstra <jeen.broekstra@gmail.com>
Date: Sat, 05 Feb 2011 17:41:46 +1300
To: public-rdf-dawg-comments@w3.org
Message-ID: <4D4CD50A.60108@gmail.com>
(This is a shortened version of a weblog posting I placed on my personal 
weblog[1]. I am sending this to the DAWG as feedback on the Oct 14 2010 
working draft, in the hope it will be useful)

I am currently in the middle of implementing SPARQL 1.1 Query Language 
into Sesame 2. The current working draft (October 14 2010) specifies a 
number of new features for SPARQL, and I will briefly make some points 
about some of the features I have implemented thus far, noting issues I 
encountered or where the current working draft was unclear to me.

1. Negation

In section 8 two additional operators are introduced, both of which can 
be used to express negation. They are (NOT) EXISTS, and MINUS. 
Implementation of the EXISTS function was quite straightforward, Sesame 
already having algebraic support for it.

The definition of MINUS in SPARQL gave me some headaches, however (note 
to DAWG: I sent an e-mail about this earlier, you may consider that mail 
to be superceded by this mail).

In Sesame's native query language SeRQL, the MINUS operator is a set 
operator operating on collections of triples - that is, the result of 
{A} MINUS {B} is the set of all triples matching A, minus all triples 
matching B. In SPARQL, however, MINUS is defined in terms of  compatible 
solutions. This means that Sesame's own algebra operator for MINUS can 
not simply be reused for SPARQL. However, it also seems that SPARQL's 
definition of MINUS makes it, for all practical purposes, exactly 
equivalent to using a NOT EXISTS filter . To see why this is, we have to 
take a look at the definitions of both operators.

In section 8.3 , the difference between NOT EXISTS and MINUS is 
explained, with a number of examples. This explanation shows that when 
the right-hand side pattern shares no variables with the left-hand 
pattern, the outcome is different. However, what is also apparent from 
this explanation that when a MINUS operator is used and no shared 
variables exist between the two patterns, the MINUS operator effectively 
does nothing.

This also follows if we look at the definition of MINUS in the SPARQL 
algebra and the definition of compatible solutions in section 17.3: by 
definition any two solutions µ and µ' which share no variables v are 
compatible. So the outcome of any such query would be exactly the same 
as if the MINUS were not there. This leaves us with two scenarios:

    1. the two patterns share a variable, in this case the MINUS can be 
replaced with a NOT EXISTS;
    2. the two patterns do not share a variable, in this case the MINUS 
can be ignored.

All in all it seems to me that MINUS as currently defined does not add 
additional expressivity to the language and is only a syntactic variant. 
I would like to know from the working group if my understanding is 
correct, if so if this is intentionally designed as such, and would 
recommend this to be clarified in the working draft.

2. Aggregates

There are a number of things unclear to me in the working draft 
regarding the expected behaviour of aggregate functions.

The first problem has to do with datatypes. Most examples take it as 
given that all input to, say, a SUM operator will be numeric values. It 
is not clearly stated what the expected behaviour is if a particular 
variable binding turns out to be non-numeric value. As a case in point, 
SUM is formally defined in terms of the XPath function op:numeric-add. 
This function's definition explicitly states that it operates only on 
specific numeric types. No mention is made however, of expected 
behaviour when one operand is not a numeric type (section 16.3 of the 
SPARQL WD does mention that a type error results for incompatible 
operands, but it is not clear if this also applies to aggregate 
operators). Moreover, it is not clearly stated how a type error in an 
aggregate function should influence the result. I can see a couple of 
possible scenarios, when a type error occurs during evaluation of an 
aggregate function:

    1. the entire query fails with an error;
    2. the incompatible operand value is ignored and evaluation continues;
    3. the aggregate operator fails silently, returning 0.

 From a usability perspective, I would probably have a preference for 
option 2, although I note that in other mathematical operators (+, -, *, 
etc.) Sesame's interpretation currently is that an incompatible operand 
results in a failed query. In any case, I would recommend that the 
working group adds an explanation on expected behaviour in such cases.

Another problem with the definition of aggregates, or more in general 
with aggregates in combination with several other features, is that it 
is not always clearly defined how they should interact (disclaimer: 
perhaps it is properly defined in section 10.2, but I'm having a hard 
time following the definitions there). For example, what happens when we 
apply an ORDER BY on a graph pattern that already has a GROUP BY and an 
aggregate function? To illustrate, take the following data set:

:org1 :affiliates :auth1, :auth2 .
:auth1 :name "John" .
:auth2 :name "Paul" .
:org2 :affiliates :auth3 .
:auth3 :name "Ringo" .

And the following query:

SELECT (GROUP_CONCAT(?name) AS ?names)
WHERE {
   ?org :affiliates ?auth .
   ?auth :name ?name.
}
GROUP BY ?org
ORDER BY ASC(str(?name))

My intuitive understanding would be that the result of this query would be:

?names
"Ringo"
"John Paul"

That is: the ordering is applied to the intermediate result of the 
grouping, thus supplying the aggregate operator (in this case, 
GROUP_CONCAT) with an ordered sequence (which makes sure that we get a 
concatenated string "John Paul" rather than "Paul John"). But it is not 
completely clear to me from the working draft if the ORDER BY clause 
should be applied to a grouping in this fashion. I would like to know 
from the working group whether my understanding is correct, and would 
also recommend that an explanation is added on how the various clauses 
and operators interact (especially any combinations with aggregates).

These are my findings thus far. I have not yet started on property paths 
or federated query. In the meantime, I would welcome any feedback on my 
notes, including feedback that tells me I should have read section 
so-and-so and it's all clear as glass if I had just taken the time to 
study it properly :)

Also, this: in the course of this work I have written several 
DAWG-Manifest style unit tests to check conformance as I saw it. They 
can be found in Sesame's SVN repository, and I'd be happy to let them be 
reused.

Regards,

Jeen Broekstra

[1] 
http://jeenbroekstra.blogspot.com/2011/02/implementing-sparql-11-query-first.html
Received on Saturday, 5 February 2011 04:42:24 UTC