bnodes as answer bindings from Pat Hayes on 2006-08-01 (public-rdf-dawg@w3.org from July to September 2006)

From: Pat Hayes <phayes@ihmc.us>
Date: Tue, 1 Aug 2006 14:33:23 -0700
To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Cc: Bijan Parsia <bparsia@cs.man.ac.uk>, Eric Prud'hommeaux <eric@w3.org>, Enrico Franconi <franconi@inf.unibz.it>
Message-Id: <p0623090fc0f3f8c99507@[192.168.1.6]>
There have been rather strong claims made in recent emails about how 
bnodes in query patterns should (must?) be handled, in some cases 
accompanied by inappropriate condescending ad-hominem comments. I 
will say no more about the latter but focus on the former, as many of 
the technical claims being made do not seem to me to be justified on 
either semantic or pragmatic grounds; and because this issue has 
significant consequences for the design of SPARQL.

First, a brief review of the issue as I understand it.

(1) There is an obvious similarity, noted by several commentators, 
between a query pattern containing a bnode and a similar pattern with 
a variable in place of the bnode. Let me refer to these as the 'bnode 
pattern' and the 'variable pattern'. For simple (RDF/S) entailment 
there is very little to choose between them, since an instance of the 
variable pattern is found in the reference graph just when the bnode 
pattern is simply entailed by the (RDF/S closure of) the reference 
graph: so one can, in these cases, think of a bnode in a pattern as 
simply being a 'blank variable' to which a binding is not required to 
be delivered in the answer (but is of course required to exist.) And 
from there it is a short step to the idea of simply banning bnodes 
from query patterns altogether, with a concomitant simplification to 
the spec., particularly to the semantics. (<Aside> there are 
pragmatic arguments in favor of allowing them, but these have been 
disputed: but in any case, if bnodes and variables are essentially 
equivalent, this debate can be conducted entirely on pragmatic 
grounds, without getting muddled with semantic issues.</Aside>)

(2) There is an objection to this comfortable picture, which is that 
for more expressive languages (OWL-DL), it is possible for a KB to 
entail an existential statement without there being any term in the 
KB for a variable to bind to; and in such a case, the natural way to 
account for this situation would be for the variable-pattern query to 
fail, but the corresponding bnode-pattern query to succeed. This 
would therefore require that the two kinds of pattern be given 
fundamentally different semantics: intuitively, one might say that a 
bnode means 'some thing exists', but a query variable means 'some 
named thing exists', where the presence of the name is what makes the 
answer binding possible; and a bnodeID does not count as a 'name' in 
the required sense.

(3) This idea is given terminological flesh by the distinction, 
introduced in 
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006AprJun/0192
between 'distinguished' and 'undistinguished' variables, and the 
observation that SPARQL currently uses 'semi-distinguished' variables 
(which can bind to both "names" and bnodeIDs). In 
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0018
this distinction is justified on the grounds that bnodes are not 
"names", so that it would not be appropriate to supply a bnodeID as 
an answer binding to a query variable; or at any rate, as is 
well-known to the cognoscenti, this is simply not the way that things 
are done:
http://lists.w3.org/Archives/Public/public-rdf-dawg/2006JulSep/0031.

(4) None of the reasons cited in the previous paragraph seem to me to 
be convincing objections to the obvious reply to the objection in 
(2), which is that bnodeIDs *can* be given as bindings to query 
variables; and since this is so, any engine which can determine the 
entailment of a pattern with a bnode, could deliver such a binding as 
an answer to the similar query pattern with a variable in place of 
the bnode. That is, if ASK PATTERN[_:x] (with _:x scoped to the 
pattern) is entailed, then SELECT ?x PATTERN[?x] should succeed, 
possibly with ?x bound to some suitable bnodeID generated by the 
querying engine.  Adopting this as a design principle for querying 
languages in the RDF family would, I suggest, bring some overall 
regularity and simplicity to the spectrum of SPARQL versions (with 
different values for the entailment and scoping set parameters.) The 
fact that bnodes exist in the language means that purely existential 
answers can indeed be given by the same variable-binding mechanism as 
is used to deliver name bindings. In other words, there should be no 
difference between 'a thing exists such that...' and 'a named thing 
exists such that...', since the "name" can itself be a bnodeID, if 
necessary one generated for the sole purpose of being a binding for 
the query variable. Let me call this the 'existential name 
principle', as it amounts to the claim that the RDF language family 
makes no distinction between existence and named existence; since if 
something exists, the presence of bnodes means that we can, in 
effect, validly generate a name for it. In the rest of this message I 
will defend this idea against the objections raised in the above 
messages, and try to distinguish this issue from others with which it 
might be confused.

(5) One objection to this idea, cited earlier, is that it embodies a 
logical confusion: bnodes are existentially bound *variables*, and 
are therefore not logical names. Now, while it is of course true that 
bnodes and URIrefs/literals (which I will collectively call 'names' 
here) have subtly different truth conditions, they are not so 
different as this contrast suggests. First, calling one a 'name' and 
the other a 'variable' is merely a matter of surface terminology. 
Many logical syntaxes make a lexical distinction but this is not 
necessary, and many others, such as ISO Common Logic, do not: a 
'variable' then is simply a name that is bound by a quantifier. 
Second, more to the point, they have essentially the same semantics: 
they both in effect assert that something exists, and use a lexical 
token to refer to it. (It is worth noting that there are logical 
syntactic conventions which do not use lexical tokens *of any kind* 
to show existence, such as Peirce's existential graphs.) The only 
difference is that the scope of a name token might extend beyond the 
local syntactic context, allowing other assertions to be made about 
the thing denoted by the name; whereas an existential 'variable' is a 
lexical token only within the context defined by the scope of the 
quantifier. In natural deduction logical systems, logical names and 
existentials are almost interconvertible, using the rules EI and EG:

PHI[name]
----------------EG
(exists (x)PHI[x])

(exists (x)PHI[x])
--------------------EI
PHI[name*]
when name* does not occur free in the derivation, ie is a 'new' name.

The need to invoke a 'new' name in EI illustrates the point about 
scoping: the 'newness' is a guarantee that no larger scope can 
'capture' this name. But once this matter is dealt with, there really 
is no logical difference at all between a name and an existentially 
bound 'variable'. (Another way to illustrate this is to consider 
alternative textbook variations of logic which dispense with names 
altogether, and only have quantified variables. Logical 'names' then 
are simply existential variables which are bound at the top lexical 
level; a typical database is considered to be a huge existential 
sentence.)

Bnodes are in fact even more similar to RDF names, i.e. to 
URIreferences, than is usual in a logical notation; since bnodes, 
like URIrefs, have global scope. This might seem an odd claim, but it 
is both true and important. The basic RDF graph syntax model, on 
which the RDF semantics is defined, has *no notion at all* of 
syntactic scope. The existential interpretation of bnodes is given 
relative to an RDF graph, but an RDF graph is simply any set of RDF 
triples. The set of all RDF triples on the Web is an RDF graph, as is 
the set of all syntactically possible RDF triples, as is any 
singleton set containing a single RDF triple, or anything in between. 
There in nothing in the abstract graph syntax which allows one to 
indicate the edge or extent of an RDF graph: no bnode binders or 
quantifiers or scoping syntax devices. All scoping of bnode IDs - not 
bnodes themselves, but their identifiers in a concrete interchange 
syntax such as SPARQL or RDF/XML - is presumed to be done by external 
conventions arising from the lexical/syntactic/scoping rules of 
whatever surface syntax form is being used to convey the abstract 
syntax, and it is these external conventions which determine which 
sets of triples are being considered as 'graphs' in any particular 
RDF application. This was a deliberate design decision, but perhaps 
one that was not adequately explained. What it amounts to is a 
decision to trade issues of lexical syntax (name scope, name binders, 
the free/bound distinction, etc.) for a purely mathematical 'abstract 
syntax' which treats all items uniformly. Thus, the only meaningful 
question about bnodes in RDF graphs is whether they are the same or 
not. If two graphs contain the same bnodes, then the implied 'scope' 
of that bnode is the union of the graphs. Thus, questions of lexical 
or syntactic scope which usually arise when talking about existential 
'variables' in a conventional syntax are replaced, in the RDF graph 
syntax, by simple questions of identity between bnodes, allowing the 
entire graph syntax to be described using elementary set theory. The 
resulting picture of the RDF graph syntax has both URIrefs and bnodes 
being treated almost identically: both have global scope, both are 
referring terms, and their role in the semantics is almost identical; 
but bnodesIDs are usually treated as having document scope, while 
URIrefs are lexical items with global scope in all Web applications.

(6) Another objection seems to be that to allow variables to bind to 
bnodes somehow introduces great additional complexity to the 
query-answering process. It seems to me that this must be wrong, 
because one can define the variable-binding option in terms of the 
other. Suppose that a query engine is capable of generating answers 
to ASK queries containing bnodes, then it can deliver appropriate 
answers to SELECT queries by the following pseudocode, where 
(gensymBnode) generates a new bnodeID and P[a/b] means the pattern P 
with a replaced by b:

if (  ASK P[(gensymBnode)/?x] ) then return (bind ?x (gensymBnode))

Of course, this needs to be written more carefully to account for 
multiple variables, but you get the idea. And as written this does 
require doing an ASK for each pattern of variables in the query; but 
this is at worst a small multiple linear extra cost, and need only be 
done in cases where a real SELECT query on the variable has failed; 
and I am sure a more efficient method can be defined in practice, 
since the derivation of the inferences supporting the existential 
query will in any case involve generating suitable 'names'. (I can 
kind of see that in the presence of elaborate equality reasoning, 
problems might arise in determining whether or not multiple bindings 
are really the same. But even here, the possibility of providing a 
mere existential name - a bnode - does seem to provide an opportunity 
for computational savings with a possible loss of completeness but 
without sacrificing correctness.)

(7) Finally, there seems to be a misapprehension that because any 
graph entails infinitely many other bnode-rich graphs which have it 
as an instance, that to allow bnodes as bindings will somehow open an 
infinite floodgate of redundant answers. And this would be true, if 
the spec were written in a very naive way so that entailment alone 
were a sufficient criterion for correctness of an answer binding. But 
this, of course, is why the SPARQL specs are written in the way that 
they are, to restrict answer bindings to a particular scoping set of 
possible bindings - in the case of normative SPARQL, the URIs and 
Bnodes which actually occur in the scoping graph. The separation 
between the scoping graph and the scoping set was written into the 
spec to allow for alternative strategies for handling OWL. Bijan 
notes that the spec as written therefore allows 'OWL-DL-SPARQL' - 
which, to emphasize, is not itself defined by the spec - to restrict 
itself to bindiings to an arbitrarily small set of legal answer 
bindings, and therefore to restrict itself so as to disallow bindings 
to bnodes. This is technically correct, but wholly against the design 
intention of those definitions. It was never the intent of that 
construction to restrict the set of legal answer bindings: it was, 
rather, to allow the scoping set to contain *extra* bnodeIDs, 
precisely in order to overcome the objection raised in the second 
paragraph above, i.e. to allow the query engine to invent new bnodes 
when they are required in order to provide an answer binding to a 
query which has a purely existential answer, for which no bnode which 
actually occurs in the scoping graph is an appropriate answer binding.

(8) Although the restricted interpretation (where a scoping set for 
OWL-DL queries must contain no bnodes) does not violate any actual 
technical requirement of the current SPARQL draft, I would argue 
strongly against adopting it as a standard, to the point that I would 
argue for pre-emptively rendering it illegal by suitable wording to 
the current spec. With the restricted interpretation, for example, we 
would have the absurd situation that answer bindings could be derived 
from a graph using RDF entailment which are not legal using OWL-DL 
entailment, even when the graph itself is legal OWL-DL/RDF.

(9) A final quick word on the claim of prior art implicit in the 
wording of some of the earlier messages. The entire subject of RDF 
querying is too new for there to be a substantial body of accepted 
techniques, accepted ideas, etc.., ; so any appeal, explicit or 
implicit, to the 'way things are done', or to a body of expert 
knowledge not vouchsafed to us all but weighty in its implications, 
should be treated as a symptom of blowhardiness; or at the very least 
in order to be taken seriously needs to be backed up by actual 
citations and/or arguments.

Comments, ripostes, objections, etc. welcome.

Pat Hayes

-- 
---------------------------------------------------------------------
IHMC		(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32502			(850)291 0667    cell
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Tuesday, 1 August 2006 21:33:37 UTC