A comment about the semantics of the Minus operator from the SPARQL specification

Hello,

I have a comment about the semantics of the Minus operator from the SPARQL specification. I stumbled upon this while trying to implement SPARQL 1.1 in the RDFox system that we’re developing at Oxford University. In Section 18.5 (http://www.w3.org/TR/sparql11-query/#sparqlAlgebra) of the SPARQL 1.1 recommendation, I see that the Minus operator is defined as follows:

Minus(Ω1, Ω2) = { μ | μ in Ω1 . ∀ μ' in Ω2, either μ and μ' are not compatible or dom(μ) and dom(μ') are disjoint }

I see an explanation below the operator’s definition for the second part of the condition (i.e., "or dom(μ) and dom(μ') are disjoint”), but this seems to me as a suboptimal choice for at least two reasons.



1. The addition of this condition breaks means that the following two patterns don’t have the same result:

(1)  { ?X rdf:type :A . OPTIONAL { ?Y rdf:type :B } }
(2)  { { ?X rdf:type :A . ?Y rdf:type :B } UNION { ?X rdf:type :A . MINUS { ?Y rdf:type :B } } }

The equivalence (A OPT B) = (A JOIN B) UNION (A MINUS B) seems natural and it holds already in SQL, so it may be unexpected for SPARQL to break this. Furthermore, query planning often depends on this equivalence, so the modified semantics of MINUS prevents the application of some well-known optimisation.



2. The present definition prevents an implementation approach that uses sideways information passing and iterators. To clarify this, consider the following pattern P, where SP1 and SP2 are some (unspecified) patterns:

(3)  P = { SP1 MINUS { SP2 } }

Now assume that we’ve got a function eval() that takes as input a pattern and that returns a set of mappings that constitute answers to the pattern. One way to implement eval(SP1 MINUS { SP2 }) is as follows:

R := empty set
For each mapping μ in eval(SP1)
    Let SP2’ be the result of applying μ to SP2 (i.e., we replace in SP2 each variable x with μ(x))
    If eval(SP2’) is empty then
        Add μ to R
return R

Evaluating the query in such a way might be better than using the usual bottom-up strategy. For example, if the size of eval(SP1) is small, whereas the size of eval(SP2) is very large but SP2 can be evaluated using indexes, then the above algorithm is much more efficient than the bottom-up one as it iterates over a small set and doesn’t require materialisation of a potentially very large set. Another benefit of this implementation strategy is that, depending on the structure of SP1 and SP2, we may not need to materialise the answers to either SP1 or SP2: if we use an iterator-based implementation (as is common in database systems), we may be able to evaluate SP1 on-the-fly, and we may also be able to check whether eval(SP2’) is empty on the fly as well.

Unfortunately, the algorithm I sketched is incorrect with respect to the semantics given in the document, and I see no way to make it correct: after applying μ to SP2, there is no way of checking the constraint "or dom(μ) and dom(μ') are disjoint” from the definition of Minus. Thus, if one is to fully obey the semantics from the specification, it seems to me that one is forced to fully evaluate SP2 so that one can later check the domain disjointness condition.



I understand that dropping the disjointness condition means that pattern { SP1 MINUS { } } has no answers regardless of the form of SP1. However, given the expected identities between JOIN, MINUS, and OPTIONAL, one might actually expect this; furthermore, I doubt that there would be many cases in practice where this effect would be observable; and finally I believe that the resulting specification would lend itself to better implementation. I understand that it might be too late to change the specification now, and I am writing this e-mail for the following reasons:

- I would like to log a technical comment that might be taken into account in a future edition of the specification.
- I am wondering whether WG members have anything to add to my comment.
- I am wondering whether it might be possible to add somewhere (e.g., in the errata document) a comment that would allow SPARQL vendors to use the modified semantics of MINUS, but with a lesser degree of compatibility with the specification.

Regards,

Boris

Received on Friday, 18 July 2014 15:08:05 UTC