Issues with evaluating optional: Commutativity of AND from Bijan Parsia on 2006-10-10 (public-rdf-dawg@w3.org from October to December 2006)

From: Bijan Parsia <bparsia@cs.man.ac.uk>
Date: Tue, 10 Oct 2006 09:08:25 +0100
To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-Id: <C15A6413-4D44-4B47-B13D-FB809E834334@cs.man.ac.uk>
I'm following:
	http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2006Apr/ 
0000.html
	http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2006Jun/ 
0008.html

And section 4 of the Semantics and Complexity of SPARQL (SCS).

This may correspond with some of Fred's issues and I apologize for  
not reconciling with his texts and email.

Sorry about length. I'm trying to be clear but still cutting out a lot.
=================================
Some high level points:

Point 1:
	SCS (and the two email above) assert that it is strongly desirable  
that the semantics of the SPARQL algebra have normal features one  
expects, e.g., compositional semantics, commutativity of conjunction,  
etc. I agree with these. They are very important for optimization,  
for example, and for end users.

Point 2:
	There are possible *ways of evaluating a SPARQL parse tree (or  
abstract syntax tree)* that do not support these features (as shown  
in SCS). The way to evaluate a SPARQL parse tree (oh, let's call it a  
"query plan" ok?) is not clear, from what I can tell, in the current  
spec. 0000.html suggests this. We should definitely *nail down* a  
semantics for the evaluation of query plans that is *unambiguous*. I  
would prefer that it met the criteria in point 1 as well.

Point 3:
	There are related issues with the scope of variables. (I think.)

=====
Here's a little explication of proposition of SCS. None of the  
content is my work.

In SCS, there is defined what I shall call UNION-normal form. That  
is, query plans may be normalized to a UNION of UNION free query  
plans. Obviously this relies on the distributivity of UNION over all  
the other operators. Drawing from SCS, proposition 1, we see, for  
example, for query plans P1, P2, and P3:

	(P1 OPT (P2 UNION P3)) <==> ((P1 OPT P2) UNION (P1 OPT P3))

The rest of proposition 1 (aside from (1) which makes AND and UNION  
associative and communtative) states similar equivalences.

One nice thing about this is it allows us to focus on *UNION-free*  
query plans (UFQ) ...since we can always push the UNIONs "outward".  
(The commutativity of UNION means we can evaluate the component query  
plans in any order, as one should expect)

I think that that proposition 1 should hold. We should want these  
equivalences and other properties.

======
This is an explication of section 4 of SCS, specifically 4.1. Still  
not my work :)

SCS claims that there are two *semantically* distinct ways of  
evaluating UFQs: One that obeys Proposition 1, and one which does not  
(the latter they observe as the behavior of ARQ). I'm going to call  
these E1 (for the compositional evaluation) and E2 (for the ARQ/ 
operational/navigational/depth-first way). (I'm *so* good at naming.)

For each case, let's ignore FILTER and focus on OPT (and obviously  
AND :)).

	For E1, I refer to the formal semantics given in SCS. Bascially, AND  
is a join, and OPT is a left-outer-join.

	The paper even defines these for ya! (Terminology point: think of  
"mapping" as being an answer (i.e., set of bindings). "compatibility"  
just means that two answers/rows that you are, e.g., joining have the  
same binding for shared variables.)

	For E2, things are a bit more complicated to state. E2 is a *depth  
first* evaluation of a plan. So, let's consider the following plan:
	((BGP1 AND BGP2) OPT BGP3)

	Parse tree:
	OPT
		AND
			BGP1
			BGP2
		BGP3

So, we go deep, the first directly evaluable node is BGP1, then BGP2,  
then the AND, then BGP3 (sibling of the AND) then the OPT (the OPT is  
tricky). Let E2 be the evaluation operator with an implicit Dataset  
argument and two arguments: A pattern to be evaluated, and a current  
set of results): So, BGP1 is evaluated against the empty set, E2 
(BGP1, {}). BGP2 is evaluated against those results, etc., i.e.:
	E2(BGP2, E2(BGP1, {})

(Let's call this A1).

Ok, so far so good. But how to evaluate the optional? I shall use J  
for join, and LOJ for left outer join.

The intution behind E2 is that for OPT(P1, P2), *first* you get the  
answers for P1, then check which of the answers may be extended with  
binding for P2. Thus, for our OPT(R1, BGP3), we perform
	LOJ(A1, E2(BGP3, A1))

Sigh. not so clear, but let's compare it to a direct LOJ reading of  
OPT/J reading of And:
	LOJ(J(BGP1, BGP2), BGP3)

Their examples are a bit different, but let's consider example 4,  
which shows that on E2 AND is not (always) communitative:

	P1 = ((?X :name :paul) AND ((?Y :name :george) OPT (?X :email ?Z)))
	P2 = (((?Y :name :george) OPT (?X :email ?Z)) AND (?X :name :paul))

Data:
	_:B1 :name :paul.
	_:B3 :name :george.

So, let
	T1 = (?X :name :paul)
	T2 = (?Y :name :george)
	T3 = (?X :email ?Z)
we have:
	P1 = (T1 AND (T2 OPT T3))
	P2 = ((T2 OPT T3) and T1)

By E1 evaluation, we have:
	E1(P1) = J(T1, LOJ(T2, T3))
	E1(P2) = J(LOJ(T2, T3), T1)

E2 goes like (and this is killing me):
	E2(P1) = E2(LOJ(
					E2(T2, E2(T1, {}),
					E2(T3, E2(T2, E2(T1,{}))),
				E2(T1, {}))
	E2(P2) = E2(T1,
			LOJ(E2(T2, {}),
				E2(T3, E2(T2, {})))

Ok, so, what are the results. For the E1 cases, we get no results  
because
	LOJ(T1, T2)
yields
	?Y		?X	?Z
	_:B3	UB	UB
(I.e., ?x, and ?z are unbound)

But then
	J(T1, LOJ(T1, T2))
is empty since _:B1 and UB don't match.

But E2(P1) yields:
	?X		?y		?Z
	_:B1	_:B3	UB

Whereas E2(P2) give no answers.

Sigh. Now, there's a quite natural *mis*reading where E2(P1) gives us  
the right answers. If we made UB match anything, I think we'd get the  
right answers all around for this query (but spooooooky). However,  
losing commutativity of AND is a high price to pay.

I say misreading because SCS points out that there's something fishy  
about the way the variables are arranged. It seems like T3,  
positionwise, is giving optional info about george, but, variablewise  
it's about paul. They forbid such variable jumping in one  
restriction, and when you do that E1 and E2 coincide.

I *think* that's my preference. The above is a stupid query, very  
counterintuitive. I personally would still like to just the  
compositional semantics explicitly, but I think forbiding such "ill- 
defined" queries is also a very good idea.

Whew. Only 00:40 here.

I want to re-re-reemphasize again and again that I'm totally  
regurgitating the published work of Jorge Perez, Marcelo Arenas, and  
Claudio Gutierrez. This example conclusively, to my mind, shows that  
we need a clear and formal specification of the algebra. As an  
implementor, I find the normal form, and the various properties  
extremely useful. The direct mapping to Join and Left Outer Join is  
very nice (even if you don't use SQL for BGP matching, it's nice to  
be able to use a query planner with what one might call "virtual  
tables").

The current formal definition of OPTIONAL isn't very formal or very  
definitional. Let's take the first  clause:

"""An optional graph pattern is a combination of a pair of graph  
patterns. The second pattern modifies pattern solutions of the first  
pattern but does not fail matching of the overall optional graph  
pattern."""

The first sentence is a bit vague (I would prefer something like "an  
ordered pair of graph patterns", if we continue with the non-operator  
oriented approach we see in UNION). The second is, on the one hand,  
written like informative text, but, on the other, suggests that E2 is  
the mode of evaluation.

"""If Opt(A, B) is an optional graph pattern, where A and B are graph  
patterns, then S is a solution of optional graph pattern if S is a  
pattern solution of A and of B otherwise if S is a solution to A, but  
not to A and B."""

This seems quite close to the definition of left outer join. However,  
I don't see any mention of unboundedness, so it's a little weird.

(Thanks to Jorge for reviewing this message and helping in general.  
There are other issues (sigh), but other posts, eh?)

Cheers,.
Biijan.
Received on Tuesday, 10 October 2006 08:08:43 UTC