bnodeification (was: Re: SPARQL semantics: open issues for basic query patterns)

I'm separating this into two threads, (this) one concerned with 
tweaking the definitions to handle bnodes properly, the other to do 
with wording for an 'entailment' parameter, based on Enrico's draft. 
I'm pretty sure these issues are orthogonal, provided that the 
definitions refer centrally to 'entails', which can either be the 
parameter or can be understood as 'simply entails' if we decide not 
to have an entailment parameter. So I will say 'entails' here without 
commitment one way or the other.

So, I learned last week that bnodes in answers are scoped to the 
answer SET, in effect, rather than to individual answers: to the 
table rather than to lines in the table. Sorry it took me so long to 
grok that. This is great, because this, plus our decision to not have 
'told bnodes', means that we can tie the definitions even closer to 
entailment. The trick for handling told bnodes was to put (a copy of) 
the graph into the consequent along with the answer instance, forcing 
them to 'share' bnodes properly. The trick for making the bnodes in 
the answers behave properly is to put ALL the answers instances into 
the consequent, which works similarly. So, all the definitions apply 
to sets of answer bindings rather than individual bindings, call this 
a binding set. We need to make one rather awkward definition, and its 
awkwardness arises from the fact that we allow bnodes in query 
patterns and we want those bnodes to be scoped to the query pattern, 
whereas we want the bnodes in the answer bindings to be scoped to the 
entire answer set. Once that is done, the other definitions are as 
clear as day.

Define the ordered merge of an RDF graph G and a SPARQL graph pattern 
P to be a pattern containing all the triples in G and M(P), where M 
is a 1:1 mapping on bnodes which replaces any bnode in P which also 
occurs in G by a new bnode not in G. If there are no such 'shared' 
bnodes then M is the identity mapping and the merge is simply the 
union of all the triples in G and P.
(This is Enrico's "RDF-merge", but uses the fact that is defined on 
patterns to be an opportunity to be technically a new definition. I 
think its slightly misleading to call it RDF-merge since technically 
its not quite the same definition as RDF merge.)

This definition should be understood to apply even when P contains no 
variables: in that case, this definition is essentially <a 
href="http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-graph-equality 
">equivalent</a> to that of RDF graph <a 
href="http://www.w3.org/TR/rdf-mt/#graphdefs">merge</a>.

Given a binding set S for a query Q, the answer graph S(Q) is, 
intuitively, the RDF graph formed by merging together one instance of 
the query Q under each binding B in S, while maintaining the identity 
of any bnodes which are introduced by the bindings themselves. More 
precisely, consider the result A<sub>N</sub> of iterating the 
following process over the bindings B<sub>1</sub> ... B<sub>N</sub> 
in S, starting with A<sub>0</sub> as the empty graph:

A<sub>n</sub> := B<sub>n</sub>(A<sub>n-1</sub> ordered-merge Q)

and define S(Q) to be  A<sub>N</sub>.

It is easy to see that the choice of ordering of S will affect only 
the particular bnodeIDs which occur in the final graph, so that the 
particular ordering can be chosen arbitrarily.

Since all the A<sub>n</sub> are RDF graphs which contain no 
variables, applying the binding mapping after the merge has the 
result required, of treating bnodes in the query pattern as scoped to 
the pattern; but all bnodes in answer bindings share a single scope, 
since once introduced into the answer graph they retain their 
identity through subsequent ordered merges.

(If the query contains no bnodes, then there is a much simpler way to 
define S(Q), as simply being the union (not merge) of all the query 
instances, i.e. the RDF graph {B(T): B in S and T in Q}. Another way 
to define S(Q) is to imagine that every bnode in Q is replaced by a 
distinct query variable (which cannot be SELECTed, however) and then 
use this definition. There are cases in which this definition is 
slightly tighter than the previous one, eliminating some redundancy. 
Also this doesn't need the ordered-merge idea.)

OK, now we have that over and done with, it all gets much easier:

A binding set S is correct when G entails S(Q)

A binding set S subsumes (we could say 'entails', but it might be 
confusing(?)) another binding set T when S(Q) entails T(Q).

The full answer set is the maximal possible correct set, i.e. all 
possible entailed instances. (This always exists and is unique, but 
it is infinite: it's purely a theoretical construct.)

An answer set is any correct set which subsumes the full answer set.

An answer set is redundant if it is subsumed by a proper subset of 
itself; otherwise it is lean.

OK, that is all. SPARQL servers MUST give only correct binding sets. 
If the number of answers is not restricted in the query then they 
MUST give an answer set. They SHOULD NOT introduce redundancies in 
answer sets. This allows redundant answers: see below for more on 
this.

This is not the same as what you get by defining the analogous thing 
for a single answer and then combining those into sets, precisely 
because of the kind of example that Sergio used, where a bnode is 
shared:

  _:a :r _:a
  _:a :p _:b
  _:b :url <http://example.org>

and the query:
  SELECT * { ?x ?p ?y }

Sergio's possible answer for a naive entailment definition:
?x    ?p   ?y
----------------
_:b12 :r   _:b13
_:b26 :p   _:b27
_:b40 :url <http://example.org>

is now not an answer set, because it doesn't subsume this:

?x    ?p   ?y
----------------
_:b12 :r _:b12

In fact, the only 'safe' way to generate an answer set (using simple 
entailment) is to send back every matching binding you find, with the 
bnode relationships between them all intact. You don't have to use 
the same bnodeIDs, though. So this isn't an answer set either:

?x    ?p   ?y
----------------
_:b12 :r   _:b12
_:b26 :p   _:b27
_:b27 :url <http://example.org>

because it doesn't subsume

?x    ?p   ?y
----------------
_:b12 :r   _:b12
_:b12 :p   _:b27

Because the bnode relationships have to be preserved in the answer 
set, this handles Bijan's FOAF example appropriately, eg.

_:a foaf:nick "Bijan"
_:a :email "badguy@moxie.com"
_:a :spouse "Elsie Badgett"
_:a :whatever :this
_:b :whatever :notThis

SELECT * {?x foaf:nick "Bijan" .
?x ?p ?a .}

an answer set must contain all the appropriate matches with a shared 
bnode, something like

?x    ?p    ?a
----------
_:b12 foaf:nick "Bijan"
_:b12 :email "badguy@moxie.com"
_:b12 :spouse "Elsie Badgett"
_:b12 :whatever :this

Answer sets can contain redundancies, but they cannot have misleading 
weakenings of the kind shown by Sergio recently, where an IRI binding 
is just ignored and replaced by a bnode, or two occurrences of a 
single bnode are arbitrarily separated. For example, with the Bijan 
dataset and the simple query

SELECT * {?x foaf:nick ?y}

it is OK for the answer set to contain

?x    ?y
-----------
....
_:b3  _:b4
....

but its not OK for it to *not* contain

_:b3  "Bijan"

I know many folk don't like allowing redundancy in answer sets. Its 
easy to tweak the definitions to eliminate it, in fact all we need to 
say is that an answer set is required to be lean, i.e. to not be 
subsumed by a proper subset. That is, answer sets can be 'non-lean', 
just like graphs, but if they are then they always have a lean subset 
which is also an answer set. However, requiring leanness is likely to 
be very onerous for servers to check, and in many cases it seems too 
strong, because it requires servers to give lean answer sets even 
when working with non-lean graphs, and 'lean-izing' a graph is hard 
in general; and even if you have a lean graph and you generate 
answers one at a time, by careful matching, you might still get some 
redundancy in the answer set.

We can adapt the trick used by Enrico to force the bindings to use 
bnodes 'reasonably', which is to require that only bnodes in the 
graph can be used as answer bindings, and also include the graph 
itself in the consequent, thereby forcing the answer instances to not 
use graph bnodes in capriciously redundant ways. This is easy to 
state, but I worry that while this works for simple entailment, it 
may well not be appropriate for other forms of entailment, and it may 
cause problems when we want to move smoothly between for example OWL 
entailment considered as applying to OWL abstract syntax and OWL/RDF 
entailment between graphs, for example, suppose the conclusion graph 
uses RDF lists as an OWL syntax encoding, constructed using bnodes: 
do *these* bnodes have to be bnodes from the dataset? If we state 
this as a basic SPARQL condition at the RDF graph level, we have 
decided this issue, perhaps prematurely. It might be better to keep 
the basic definitions simple, as above, and keep the issue of 
redundancy-elimination slightly separate.

(Bijan, suppose we said just that servers SHOULD minimize redundancy 
as far as practicable, or SHOULD NOT introduce unnecessary or 
artificial redundancy in answer sets, could you live with that? I'm 
pretty sure that any implemented SPARQL engine will give redundant 
answers only 'by accident', rather than set out to deliberately 
sprinkle bnodes everywhere for no good reason. But read on.)

Here's how Enrico-style would look, for the record:

A binding set S is regular when (1) it is correct and (2) G entails 
S(G' ordered-merge Q), and (3) when any bnodeIDs introduced by S 
occur somewhere in G', where G' is <a 
href="http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-graph-equality">equivalent</a> 
to G. (G' instead of G allows the bnodeIDs used in the answer to be 
distinct from the graphIDs, which is better if we aren't intending to 
support told bnodes, I suggest.)

Note, that application of S in (1) is of the entire binding set, so 
it uses the iterative definition from earlier, in effect generating a 
lot of 'copies' of G, one for each binding in S: but since the result 
is a set, this doesn't matter. Or, we could re-define the iterative 
construction for S(Q) but starting from G' instead of the empty 
graph, which links the two ideas together nicely. Or, we could just 
define a single answer as a B where G entails B(G order-merge Q), as 
Enrico's prose does now, and then say that S is the set of all such 
answers: but that requires a little work to show that it is an answer 
set.)

A regular answer set is a 'reasonably computably minimal' answer set. 
Its about as non-redundant as you can get while generating answers 
one at a time, i.e. without needing to check answers against all the 
other answers.

----

As Dan C noted, all of this might well be identical in operation to 
his older suggestion: my point is only to offer it as a definitional 
strategy to keep the definitions as simple and clear as possible, and 
as purely linked to entailment as possible. What I like about the 
first idea here is that its directly linked to entailment and in some 
sense 'mathematically obvious', so I think it is likely to be very 
robust; and I also think it is easy to understand. We can then treat 
the second device, of including G' and restricting to bnodes in G' 
(Sergio was right about not restricting non-bnodes), as a way to 
eliminate excess redundancy, which we could offer the world as a Good 
Way to conform to the SHOULD part of the basic definition, but not 
actually require it.

Comments? I will provide answers for any test case anyone has.

Pat

PS. BTW, in case anyone is misled by my writing style, I see all this 
as a tweak/increment to the ideas that Enrico and Sergio have 
provided, more than as a replacement. The definition of ordered-merge 
is exactly RDF-merge: I suggest the change in name only for reasons 
of exposition, not to imply spurious claims of authorship or 
originality. This is all going to go out under Eric and Andy's names 
anyway.

-- 
---------------------------------------------------------------------
IHMC		(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32502			(850)291 0667    cell
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes

Received on Monday, 9 January 2006 23:27:33 UTC