- From: Jeen Broekstra <jeen.broekstra@gmail.com>
- Date: Sat, 05 Feb 2011 17:41:46 +1300
- To: public-rdf-dawg-comments@w3.org
(This is a shortened version of a weblog posting I placed on my personal
weblog[1]. I am sending this to the DAWG as feedback on the Oct 14 2010
working draft, in the hope it will be useful)
I am currently in the middle of implementing SPARQL 1.1 Query Language
into Sesame 2. The current working draft (October 14 2010) specifies a
number of new features for SPARQL, and I will briefly make some points
about some of the features I have implemented thus far, noting issues I
encountered or where the current working draft was unclear to me.
1. Negation
In section 8 two additional operators are introduced, both of which can
be used to express negation. They are (NOT) EXISTS, and MINUS.
Implementation of the EXISTS function was quite straightforward, Sesame
already having algebraic support for it.
The definition of MINUS in SPARQL gave me some headaches, however (note
to DAWG: I sent an e-mail about this earlier, you may consider that mail
to be superceded by this mail).
In Sesame's native query language SeRQL, the MINUS operator is a set
operator operating on collections of triples - that is, the result of
{A} MINUS {B} is the set of all triples matching A, minus all triples
matching B. In SPARQL, however, MINUS is defined in terms of compatible
solutions. This means that Sesame's own algebra operator for MINUS can
not simply be reused for SPARQL. However, it also seems that SPARQL's
definition of MINUS makes it, for all practical purposes, exactly
equivalent to using a NOT EXISTS filter . To see why this is, we have to
take a look at the definitions of both operators.
In section 8.3 , the difference between NOT EXISTS and MINUS is
explained, with a number of examples. This explanation shows that when
the right-hand side pattern shares no variables with the left-hand
pattern, the outcome is different. However, what is also apparent from
this explanation that when a MINUS operator is used and no shared
variables exist between the two patterns, the MINUS operator effectively
does nothing.
This also follows if we look at the definition of MINUS in the SPARQL
algebra and the definition of compatible solutions in section 17.3: by
definition any two solutions µ and µ' which share no variables v are
compatible. So the outcome of any such query would be exactly the same
as if the MINUS were not there. This leaves us with two scenarios:
1. the two patterns share a variable, in this case the MINUS can be
replaced with a NOT EXISTS;
2. the two patterns do not share a variable, in this case the MINUS
can be ignored.
All in all it seems to me that MINUS as currently defined does not add
additional expressivity to the language and is only a syntactic variant.
I would like to know from the working group if my understanding is
correct, if so if this is intentionally designed as such, and would
recommend this to be clarified in the working draft.
2. Aggregates
There are a number of things unclear to me in the working draft
regarding the expected behaviour of aggregate functions.
The first problem has to do with datatypes. Most examples take it as
given that all input to, say, a SUM operator will be numeric values. It
is not clearly stated what the expected behaviour is if a particular
variable binding turns out to be non-numeric value. As a case in point,
SUM is formally defined in terms of the XPath function op:numeric-add.
This function's definition explicitly states that it operates only on
specific numeric types. No mention is made however, of expected
behaviour when one operand is not a numeric type (section 16.3 of the
SPARQL WD does mention that a type error results for incompatible
operands, but it is not clear if this also applies to aggregate
operators). Moreover, it is not clearly stated how a type error in an
aggregate function should influence the result. I can see a couple of
possible scenarios, when a type error occurs during evaluation of an
aggregate function:
1. the entire query fails with an error;
2. the incompatible operand value is ignored and evaluation continues;
3. the aggregate operator fails silently, returning 0.
From a usability perspective, I would probably have a preference for
option 2, although I note that in other mathematical operators (+, -, *,
etc.) Sesame's interpretation currently is that an incompatible operand
results in a failed query. In any case, I would recommend that the
working group adds an explanation on expected behaviour in such cases.
Another problem with the definition of aggregates, or more in general
with aggregates in combination with several other features, is that it
is not always clearly defined how they should interact (disclaimer:
perhaps it is properly defined in section 10.2, but I'm having a hard
time following the definitions there). For example, what happens when we
apply an ORDER BY on a graph pattern that already has a GROUP BY and an
aggregate function? To illustrate, take the following data set:
:org1 :affiliates :auth1, :auth2 .
:auth1 :name "John" .
:auth2 :name "Paul" .
:org2 :affiliates :auth3 .
:auth3 :name "Ringo" .
And the following query:
SELECT (GROUP_CONCAT(?name) AS ?names)
WHERE {
?org :affiliates ?auth .
?auth :name ?name.
}
GROUP BY ?org
ORDER BY ASC(str(?name))
My intuitive understanding would be that the result of this query would be:
?names
"Ringo"
"John Paul"
That is: the ordering is applied to the intermediate result of the
grouping, thus supplying the aggregate operator (in this case,
GROUP_CONCAT) with an ordered sequence (which makes sure that we get a
concatenated string "John Paul" rather than "Paul John"). But it is not
completely clear to me from the working draft if the ORDER BY clause
should be applied to a grouping in this fashion. I would like to know
from the working group whether my understanding is correct, and would
also recommend that an explanation is added on how the various clauses
and operators interact (especially any combinations with aggregates).
These are my findings thus far. I have not yet started on property paths
or federated query. In the meantime, I would welcome any feedback on my
notes, including feedback that tells me I should have read section
so-and-so and it's all clear as glass if I had just taken the time to
study it properly :)
Also, this: in the course of this work I have written several
DAWG-Manifest style unit tests to check conformance as I saw it. They
can be found in Sesame's SVN repository, and I'd be happy to let them be
reused.
Regards,
Jeen Broekstra
[1]
http://jeenbroekstra.blogspot.com/2011/02/implementing-sparql-11-query-first.html
Received on Saturday, 5 February 2011 04:42:24 UTC