- From: Jeen Broekstra <jeen.broekstra@gmail.com>
- Date: Sat, 05 Feb 2011 17:41:46 +1300
- To: public-rdf-dawg-comments@w3.org
(This is a shortened version of a weblog posting I placed on my personal weblog[1]. I am sending this to the DAWG as feedback on the Oct 14 2010 working draft, in the hope it will be useful) I am currently in the middle of implementing SPARQL 1.1 Query Language into Sesame 2. The current working draft (October 14 2010) specifies a number of new features for SPARQL, and I will briefly make some points about some of the features I have implemented thus far, noting issues I encountered or where the current working draft was unclear to me. 1. Negation In section 8 two additional operators are introduced, both of which can be used to express negation. They are (NOT) EXISTS, and MINUS. Implementation of the EXISTS function was quite straightforward, Sesame already having algebraic support for it. The definition of MINUS in SPARQL gave me some headaches, however (note to DAWG: I sent an e-mail about this earlier, you may consider that mail to be superceded by this mail). In Sesame's native query language SeRQL, the MINUS operator is a set operator operating on collections of triples - that is, the result of {A} MINUS {B} is the set of all triples matching A, minus all triples matching B. In SPARQL, however, MINUS is defined in terms of compatible solutions. This means that Sesame's own algebra operator for MINUS can not simply be reused for SPARQL. However, it also seems that SPARQL's definition of MINUS makes it, for all practical purposes, exactly equivalent to using a NOT EXISTS filter . To see why this is, we have to take a look at the definitions of both operators. In section 8.3 , the difference between NOT EXISTS and MINUS is explained, with a number of examples. This explanation shows that when the right-hand side pattern shares no variables with the left-hand pattern, the outcome is different. However, what is also apparent from this explanation that when a MINUS operator is used and no shared variables exist between the two patterns, the MINUS operator effectively does nothing. This also follows if we look at the definition of MINUS in the SPARQL algebra and the definition of compatible solutions in section 17.3: by definition any two solutions µ and µ' which share no variables v are compatible. So the outcome of any such query would be exactly the same as if the MINUS were not there. This leaves us with two scenarios: 1. the two patterns share a variable, in this case the MINUS can be replaced with a NOT EXISTS; 2. the two patterns do not share a variable, in this case the MINUS can be ignored. All in all it seems to me that MINUS as currently defined does not add additional expressivity to the language and is only a syntactic variant. I would like to know from the working group if my understanding is correct, if so if this is intentionally designed as such, and would recommend this to be clarified in the working draft. 2. Aggregates There are a number of things unclear to me in the working draft regarding the expected behaviour of aggregate functions. The first problem has to do with datatypes. Most examples take it as given that all input to, say, a SUM operator will be numeric values. It is not clearly stated what the expected behaviour is if a particular variable binding turns out to be non-numeric value. As a case in point, SUM is formally defined in terms of the XPath function op:numeric-add. This function's definition explicitly states that it operates only on specific numeric types. No mention is made however, of expected behaviour when one operand is not a numeric type (section 16.3 of the SPARQL WD does mention that a type error results for incompatible operands, but it is not clear if this also applies to aggregate operators). Moreover, it is not clearly stated how a type error in an aggregate function should influence the result. I can see a couple of possible scenarios, when a type error occurs during evaluation of an aggregate function: 1. the entire query fails with an error; 2. the incompatible operand value is ignored and evaluation continues; 3. the aggregate operator fails silently, returning 0. From a usability perspective, I would probably have a preference for option 2, although I note that in other mathematical operators (+, -, *, etc.) Sesame's interpretation currently is that an incompatible operand results in a failed query. In any case, I would recommend that the working group adds an explanation on expected behaviour in such cases. Another problem with the definition of aggregates, or more in general with aggregates in combination with several other features, is that it is not always clearly defined how they should interact (disclaimer: perhaps it is properly defined in section 10.2, but I'm having a hard time following the definitions there). For example, what happens when we apply an ORDER BY on a graph pattern that already has a GROUP BY and an aggregate function? To illustrate, take the following data set: :org1 :affiliates :auth1, :auth2 . :auth1 :name "John" . :auth2 :name "Paul" . :org2 :affiliates :auth3 . :auth3 :name "Ringo" . And the following query: SELECT (GROUP_CONCAT(?name) AS ?names) WHERE { ?org :affiliates ?auth . ?auth :name ?name. } GROUP BY ?org ORDER BY ASC(str(?name)) My intuitive understanding would be that the result of this query would be: ?names "Ringo" "John Paul" That is: the ordering is applied to the intermediate result of the grouping, thus supplying the aggregate operator (in this case, GROUP_CONCAT) with an ordered sequence (which makes sure that we get a concatenated string "John Paul" rather than "Paul John"). But it is not completely clear to me from the working draft if the ORDER BY clause should be applied to a grouping in this fashion. I would like to know from the working group whether my understanding is correct, and would also recommend that an explanation is added on how the various clauses and operators interact (especially any combinations with aggregates). These are my findings thus far. I have not yet started on property paths or federated query. In the meantime, I would welcome any feedback on my notes, including feedback that tells me I should have read section so-and-so and it's all clear as glass if I had just taken the time to study it properly :) Also, this: in the course of this work I have written several DAWG-Manifest style unit tests to check conformance as I saw it. They can be found in Sesame's SVN repository, and I'd be happy to let them be reused. Regards, Jeen Broekstra [1] http://jeenbroekstra.blogspot.com/2011/02/implementing-sparql-11-query-first.html
Received on Saturday, 5 February 2011 04:42:24 UTC