bnodeification (was: Re: SPARQL semantics: open issues for basic query patterns) from Pat Hayes on 2006-01-09 (public-rdf-dawg@w3.org from January to March 2006)

From: Pat Hayes <phayes@ihmc.us>
Date: Mon, 9 Jan 2006 17:27:19 -0600
To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Cc: bparsia@isr.umd.edu, Enrico Franconi <franconi@inf.unibz.it>, Sergio Tessaris <tessaris@inf.unibz.it>
Message-Id: <p06230901bfe4537401a4@[10.100.0.7]>

I'm separating this into two threads, (this) one concerned with
tweaking the definitions to handle bnodes properly, the other to do
with wording for an 'entailment' parameter, based on Enrico's draft.
I'm pretty sure these issues are orthogonal, provided that the
definitions refer centrally to 'entails', which can either be the
parameter or can be understood as 'simply entails' if we decide not
to have an entailment parameter. So I will say 'entails' here without
commitment one way or the other.

So, I learned last week that bnodes in answers are scoped to the
answer SET, in effect, rather than to individual answers: to the
table rather than to lines in the table. Sorry it took me so long to
grok that. This is great, because this, plus our decision to not have
'told bnodes', means that we can tie the definitions even closer to
entailment. The trick for handling told bnodes was to put (a copy of)
the graph into the consequent along with the answer instance, forcing
them to 'share' bnodes properly. The trick for making the bnodes in
the answers behave properly is to put ALL the answers instances into
the consequent, which works similarly. So, all the definitions apply
to sets of answer bindings rather than individual bindings, call this
a binding set. We need to make one rather awkward definition, and its
awkwardness arises from the fact that we allow bnodes in query
patterns and we want those bnodes to be scoped to the query pattern,
whereas we want the bnodes in the answer bindings to be scoped to the
entire answer set. Once that is done, the other definitions are as
clear as day.

Define the ordered merge of an RDF graph G and a SPARQL graph pattern
P to be a pattern containing all the triples in G and M(P), where M
is a 1:1 mapping on bnodes which replaces any bnode in P which also
occurs in G by a new bnode not in G. If there are no such 'shared'
bnodes then M is the identity mapping and the merge is simply the
union of all the triples in G and P.
(This is Enrico's "RDF-merge", but uses the fact that is defined on
patterns to be an opportunity to be technically a new definition. I
think its slightly misleading to call it RDF-merge since technically
its not quite the same definition as RDF merge.)

This definition should be understood to apply even when P contains no
variables: in that case, this definition is essentially <a
href="http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-graph-equality
">equivalent</a> to that of RDF graph <a
href="http://www.w3.org/TR/rdf-mt/#graphdefs">merge</a>.

Given a binding set S for a query Q, the answer graph S(Q) is,
intuitively, the RDF graph formed by merging together one instance of
the query Q under each binding B in S, while maintaining the identity
of any bnodes which are introduced by the bindings themselves. More
precisely, consider the result AN of iterating the
following process over the bindings B1 ... BN
in S, starting with A0 as the empty graph:

An := Bn(An-1 ordered-merge Q)

and define S(Q) to be AN.

It is easy to see that the choice of ordering of S will affect only
the particular bnodeIDs which occur in the final graph, so that the
particular ordering can be chosen arbitrarily.

Since all the An are RDF graphs which contain no
variables, applying the binding mapping after the merge has the
result required, of treating bnodes in the query pattern as scoped to
the pattern; but all bnodes in answer bindings share a single scope,
since once introduced into the answer graph they retain their
identity through subsequent ordered merges.

(If the query contains no bnodes, then there is a much simpler way to
define S(Q), as simply being the union (not merge) of all the query
instances, i.e. the RDF graph {B(T): B in S and T in Q}. Another way
to define S(Q) is to imagine that every bnode in Q is replaced by a
distinct query variable (which cannot be SELECTed, however) and then
use this definition. There are cases in which this definition is
slightly tighter than the previous one, eliminating some redundancy.
Also this doesn't need the ordered-merge idea.)

OK, now we have that over and done with, it all gets much easier:

A binding set S is correct when G entails S(Q)

A binding set S subsumes (we could say 'entails', but it might be
confusing(?)) another binding set T when S(Q) entails T(Q).

The full answer set is the maximal possible correct set, i.e. all
possible entailed instances. (This always exists and is unique, but
it is infinite: it's purely a theoretical construct.)

An answer set is any correct set which subsumes the full answer set.

An answer set is redundant if it is subsumed by a proper subset of
itself; otherwise it is lean.

OK, that is all. SPARQL servers MUST give only correct binding sets.
If the number of answers is not restricted in the query then they
MUST give an answer set. They SHOULD NOT introduce redundancies in
answer sets. This allows redundant answers: see below for more on
this.

This is not the same as what you get by defining the analogous thing
for a single answer and then combining those into sets, precisely
because of the kind of example that Sergio used, where a bnode is
shared:

_:a :r _:a
_:a :p _:b
_:b :url <http://example.org>

and the query:
SELECT * { ?x ?p ?y }

Sergio's possible answer for a naive entailment definition:
?x ?p ?y
----------------
_:b12 :r _:b13
_:b26 :p _:b27
_:b40 :url <http://example.org>

is now not an answer set, because it doesn't subsume this:

?x ?p ?y
----------------
_:b12 :r _:b12

In fact, the only 'safe' way to generate an answer set (using simple
entailment) is to send back every matching binding you find, with the
bnode relationships between them all intact. You don't have to use
the same bnodeIDs, though. So this isn't an answer set either:

?x ?p ?y
----------------
_:b12 :r _:b12
_:b26 :p _:b27
_:b27 :url <http://example.org>

because it doesn't subsume

?x ?p ?y
----------------
_:b12 :r _:b12
_:b12 :p _:b27

Because the bnode relationships have to be preserved in the answer
set, this handles Bijan's FOAF example appropriately, eg.

_:a foaf:nick "Bijan"
_:a :email "badguy@moxie.com"
_:a :spouse "Elsie Badgett"
_:a :whatever :this
_:b :whatever :notThis

SELECT * {?x foaf:nick "Bijan" .
?x ?p ?a .}

an answer set must contain all the appropriate matches with a shared
bnode, something like

?x ?p ?a
----------
_:b12 foaf:nick "Bijan"
_:b12 :email "badguy@moxie.com"
_:b12 :spouse "Elsie Badgett"
_:b12 :whatever :this

Answer sets can contain redundancies, but they cannot have misleading
weakenings of the kind shown by Sergio recently, where an IRI binding
is just ignored and replaced by a bnode, or two occurrences of a
single bnode are arbitrarily separated. For example, with the Bijan
dataset and the simple query

SELECT * {?x foaf:nick ?y}

it is OK for the answer set to contain

?x ?y
-----------
....
_:b3 _:b4
....

but its not OK for it to *not* contain

_:b3 "Bijan"

I know many folk don't like allowing redundancy in answer sets. Its
easy to tweak the definitions to eliminate it, in fact all we need to
say is that an answer set is required to be lean, i.e. to not be
subsumed by a proper subset. That is, answer sets can be 'non-lean',
just like graphs, but if they are then they always have a lean subset
which is also an answer set. However, requiring leanness is likely to
be very onerous for servers to check, and in many cases it seems too
strong, because it requires servers to give lean answer sets even
when working with non-lean graphs, and 'lean-izing' a graph is hard
in general; and even if you have a lean graph and you generate
answers one at a time, by careful matching, you might still get some
redundancy in the answer set.

We can adapt the trick used by Enrico to force the bindings to use
bnodes 'reasonably', which is to require that only bnodes in the
graph can be used as answer bindings, and also include the graph
itself in the consequent, thereby forcing the answer instances to not
use graph bnodes in capriciously redundant ways. This is easy to
state, but I worry that while this works for simple entailment, it
may well not be appropriate for other forms of entailment, and it may
cause problems when we want to move smoothly between for example OWL
entailment considered as applying to OWL abstract syntax and OWL/RDF
entailment between graphs, for example, suppose the conclusion graph
uses RDF lists as an OWL syntax encoding, constructed using bnodes:
do *these* bnodes have to be bnodes from the dataset? If we state
this as a basic SPARQL condition at the RDF graph level, we have
decided this issue, perhaps prematurely. It might be better to keep
the basic definitions simple, as above, and keep the issue of
redundancy-elimination slightly separate.

(Bijan, suppose we said just that servers SHOULD minimize redundancy
as far as practicable, or SHOULD NOT introduce unnecessary or
artificial redundancy in answer sets, could you live with that? I'm
pretty sure that any implemented SPARQL engine will give redundant
answers only 'by accident', rather than set out to deliberately
sprinkle bnodes everywhere for no good reason. But read on.)

Here's how Enrico-style would look, for the record:

A binding set S is regular when (1) it is correct and (2) G entails
S(G' ordered-merge Q), and (3) when any bnodeIDs introduced by S
occur somewhere in G', where G' is <a
href="http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-graph-equality">equivalent</a>
to G. (G' instead of G allows the bnodeIDs used in the answer to be
distinct from the graphIDs, which is better if we aren't intending to
support told bnodes, I suggest.)

Note, that application of S in (1) is of the entire binding set, so
it uses the iterative definition from earlier, in effect generating a
lot of 'copies' of G, one for each binding in S: but since the result
is a set, this doesn't matter. Or, we could re-define the iterative
construction for S(Q) but starting from G' instead of the empty
graph, which links the two ideas together nicely. Or, we could just
define a single answer as a B where G entails B(G order-merge Q), as
Enrico's prose does now, and then say that S is the set of all such
answers: but that requires a little work to show that it is an answer
set.)

A regular answer set is a 'reasonably computably minimal' answer set.
Its about as non-redundant as you can get while generating answers
one at a time, i.e. without needing to check answers against all the
other answers.

----

As Dan C noted, all of this might well be identical in operation to
his older suggestion: my point is only to offer it as a definitional
strategy to keep the definitions as simple and clear as possible, and
as purely linked to entailment as possible. What I like about the
first idea here is that its directly linked to entailment and in some
sense 'mathematically obvious', so I think it is likely to be very
robust; and I also think it is easy to understand. We can then treat
the second device, of including G' and restricting to bnodes in G'
(Sergio was right about not restricting non-bnodes), as a way to
eliminate excess redundancy, which we could offer the world as a Good
Way to conform to the SHOULD part of the basic definition, but not
actually require it.

Comments? I will provide answers for any test case anyone has.

Pat

PS. BTW, in case anyone is misled by my writing style, I see all this
as a tweak/increment to the ideas that Enrico and Sergio have
provided, more than as a replacement. The definition of ordered-merge
is exactly RDF-merge: I suggest the change in name only for reasons
of exposition, not to imply spurious claims of authorship or
originality. This is all going to go out under Eric and Andy's names
anyway.

--
---------------------------------------------------------------------
IHMC (850)434 8903 or (650)494 3973 home
40 South Alcaniz St. (850)202 4416 office
Pensacola (850)202 4440 fax
FL 32502 (850)291 0667 cell
phayesAT-SIGNihmc.us http://www.ihmc.us/users/phayes

Received on Monday, 9 January 2006 23:27:33 UTC