Bnodes and the RDF graph syntax (part 1). from Pat Hayes on 2006-08-20 (public-rdf-dawg@w3.org from July to September 2006)

From: Pat Hayes <phayes@ihmc.us>
Date: Sun, 20 Aug 2006 12:46:33 -0700
To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-Id: <p06230961c10da323264b@[192.168.1.6]>
I will try here to explain why 'told bnodes' aren't inconsistent with 
the RDF semantics (and RDF graph syntax), and why the RDF specs lead 
naturally to a view of bnodes as 'anonymous identifiers' (although I 
don't like that terminology) and why this isn't inconsistent with the 
RDF specs or the basic RDF model. This will also, hopefully, explain 
what I have been talking about in some of my exchanges with Bijan 
regarding existentials, names, quantifiers, bnodes having global 
scope and so on. This will take a while and require some discursive 
background, so please bear with me.

First, it is important to distinguish between bnodes and bnodeIDs. In 
the RDF abstract graph syntax, there are no bnode IDs: bnodes are, 
literally, blank. Think of an RDF graph as a directed node-arc-node 
network with URIrefs and literals at the nodes and URIrefs on the 
arcs; then the blank nodes are, literally, blank nodes; nodes in this 
network that have *no label at all*. Those diagrams *are* the 
abstract syntax of RDF: but in order to lexicalize them in a 
character string, it is necessary to introduce some way to indicate 
when you are referring to the same blank node, so a document which 
lexicalizes an RDF graph has to use bnodeIDs (or something similar) 
to indicate this. But these IDs are not the bnodes themselves: in 
fact, they can be naturally thought of as *denoting* the bnodes, so 
that a lexicalized form for RDF (such as RDF/XML or Ntriples) is in 
fact in a sense a meta-syntactic description of some RDF graph 
structure.

The reason this is more than just a curiosity, is that it means that 
bnodes are not lexical tokens. Since they have no lexical form, the 
type/token distinction is meaningless for bnodes. Bnodes are globally 
unique, and there are no relationships between them (other than 
identity). There cannot be two occurrences of the 'same' blank node 
in two disconnected graphs, as there can be two different occurrences 
(tokens) of the same letter (type) in two separate documents. This is 
an important way in which RDF graph syntax differs from conventional 
lexical syntax. In a conventional logical syntax we might have two 
expressions both using the same name:
(P x)
(Q x x)
and then binding expressions are used to indicate which of these 
occurrences are considered to be 'the same variable' as others, which 
we conventionally refer to as being in the same scope, so that
(exists (x)( and (P x)(Q x x)))
makes all the tokens of 'x' into instances of the same variable, while
(exists (x)(P x))
(exists (x)(Q x x))
makes different tokens into instances of distinct variables (which 
happen to have the same lexical form, but this is no longer of any 
logical importance.) All grade-school stuff, I know, but my point is 
that none of this makes sense for bnodes. That second expression 
would require *two distinct* bnodes to be encoded in RDF graph 
syntax. You cannot re-use the same bnode in two different graphs. 
Think about it: if you draw a node-arc diagram in one place and I 
draw a different diagram in another place, and they both have blank 
nodes in them, how can those possibly be the *same* blank node? 
Unless, that is, we are both using the same diagram, with one node in 
it, and both linking our arcs to it: but then if you are writing the 
RDF equivalent of (P x) - which might look like

[blank node] ---rdf:type--->:P

and I am writing the equivalent of (Q x x), which might look like 
this, excuse the ascii-art:

[blank node] --- :Q ---
     /\                 |
     |__________________|

and if those really are the *same* bnode,  then we have, together, 
written the equivalent of the first existential, with both atoms in 
the same binder scope: an RDF graph with a total of two (not three) 
nodes and two arcs.

Now, the RDF concepts document describes the graph syntax using 
mathematical language rather than talking about these pictures, but 
the pictures are very much the driving intuition. (Hence the 
terminology of RDF "graph" and blank "node", although these are not, 
technically, graphs in the graph theory sense.) The mathematics says 
simply that bnodes are an infinite set of 'entities' which are 
different from URIrefs and literals. (Think of the set of anonymous 
blobs in all the possible node-arc-node diagrams that there ever 
could be. This set is global, to emphasize again: don't think of 
bnodes as being like words built up from a finite vocabulary, say, or 
as like anything constructed. They just *exist*.) Triples are 
3-tuples of bnodes, literals and URIrefs, and then an RDF graph is a 
set of triples, yadda yadda, you know the rest. Any such set can 
obviously be written as a node-arc diagram (with suitable noises 
about infinite cases) and any such diagram defines a set of triples. 
But, notice that the spec does not say that *every* set of triples is 
an RDF graph. In fact, it doesn't actually say anywhere what counts 
as 'an RDF graph': it doesn't say where the 'boundary' of an RDF 
graph is supposed to be, and it provides no way to indicate where 
such boundaries lie. This is what I meant when I said to Bijan that 
RDF has no quantifiers. There is nothing in the RDF graph syntax 
which can be used to specify the edge or boundary of an RDF graph; no 
scoping conventions, as it were: no quantifiers, no brackets or 
parentheses. What makes a set of triples into an RDF graph is left 
implicit or unstated in the RDF specs; and this is a deliberate 
omission, because we explicitly did not want to create any such 
boundaries. Part of the very idea of RDF, the rationale for inventing 
it in the first place, was the idea of being able to gather together 
scraps of information from a variety of sources and use it together 
in one place. At some point in this process, you would decide that 
you had 'an RDF graph', and do something with it; but this RDF graph 
might not correspond to any document boundary or indeed any 
syntactically marked boundary at all: it might be defined by any 
number of criteria, and it might be highly temporary, or evanescent, 
or virtual.

But, one might object, if RDF doesn't define where the boundaries of 
an RDF graph are, and yet the notion of 'RDF graph' is so central to 
the semantics, isn't the whole thing seriously underspecified? But in 
fact, the underspecificity is only apparent. To explain this I need 
to take a brief diversion into elementary logic, using conventional 
quantifier/bound-variable language. What are quantifiers for, after 
all? In full FO logic, they can be nested in all sorts of ways, but 
in RDF there are no universal quantifiers (and in RDFS, the 
'implicit' universal quantifiers represented by such things as 
rdfs:subClass are all inside the scope of all the existentials, i.e. 
it is an EA logic) and so we know that any bnode is simply 
existentially quantified at the top level (not inside any other 
quantifiers); and under these circumstances, the only point of having 
quantifiers would be to indicate the scope of the particular 
variable, to be able to distinguish things like the two cases in the 
above example. But, note, this need arises *only if the same variable 
name is used in two different scopes*. Consider for a moment a 
variation on 'normal' logical notation in which every quantifier 
binds a distinct bound variable, unique to the quantifier, so that 
all bound variable names are 'standardized apart' simply as a matter 
of well-formedness of the syntax, and every variable in the text is 
bound by a unique quantifier. (This would be totally impractical with 
a conventional notation, of course, since there would be no way to 
know how to accidentally avoid re-using a name someone else had used 
without you knowing about it, not to mention that we would quickly 
run out of plausible variable names. But just imagine it, anyway. It 
is very similar to some ancient logical notations which use lines or 
circles rather than variable-name correspondences to indicate 
connections between quantifiers and what they bind.) Under this 
unique-variable-name condition, all quantifiers can be moved outwards 
without changing the meaning of the expression. That is, if x does 
not occur in Q and y does not occur in P, then
(exist (x) P) & (exists (y) Q)
is equivalent to
(exists (x y)(P & Q))
and as long as we can be confident that no accidental 'capturing' of 
a name by a quantifier can occur, we can go on moving these 
quantifiers outwards as far as we like. In fact we can imagine them 
moved so far 'out' that their scope is the entire Web, at which point 
they become unnecessary, since all the quantifier scopes are the 
same, and all are global. Of course, they still conform to the 
existential semantics, so are not really *identifiers*, and this only 
works if we have some syntactic way to distinguish them from other 
names; but of course we do, in RDF.

Since bnodes do have this syntactic uniqueness, by design, they can 
be thought of in this way. And this is exactly the way that the RDF 
model was intended to be used. Which is why there are no quantifiers 
in RDF. What it amounts to is that there is no notion of 'lexical 
scope' in RDF; the task (and it is essentially a syntactic task) that 
scoping serves in a conventional notation has been replaced, in RDF 
graph syntax, by the idea of bnode *identity*. If two chunks of RDF 
have no bnodes in common, we can think of them as separate graphs 
without changing the meaning of anything. In this case, the graph 
formed by their union is equivalent to the conjunction of the two 
graphs considered in isolation, so the graph boundaries are of little 
importance, and do not affect any meanings: we can be careless about 
where exactly the graph boundaries are. If they share a bnode, then 
they are conceptually parts of a larger graph, their union. You *can* 
think of a subgraph of a larger graph as being a graph in its own 
right, but then the appropriate thing to do is to change the bnodes 
in it which are shared with other parts of the graph, so as to keep 
the bnode populations of the two distinct graphs separate: this leads 
to the idea a merge rather than a union. There is no significance to 
the particular choice of the *actual* bnodes, of course, since they 
are blank; hence the convention that graph-equivalent graphs are 
treated as effectively interchangeable. In effect, bnodes are like 
interchangeable tokens with no particular significance in themselves, 
other than being different from all the others. Again, think of blobs 
in diagrams.

If two sets of RDF triples do not share a bnode, then one can think 
of those bnodes as being like bound variables bound by quantifiers 
just at the edge of the graphs. (Imagine that each bnode 'calls down' 
its unique binding quantifier from the edge of the entire Web, to 
surround this particular graph. That scope change doesn't affect any 
meanings, since that quantifier doesn't bind anything else on the 
entire Web, because bnodes are globally unique.) This allows the RDF 
graph syntax to be treated very similarly to a conventional syntax, 
almost all the time, and why one can often manage quite well without 
bothering about the distinction between bnodes and bnodeIDs. One can 
'think lexically', treating the bnodeID scopes as though they were 
bnode scopes. But in fact, this is rather sloppy, and only works as 
long as the ID scopes - which are determined by document conventions 
rather than by the RDF specs - are naturally aligned with 
co-occurrences of bnodes in the described graphs. They will be, as 
long as there is no way for two distinct documents to describe graphs 
with a bnode in common: and before SPARQL, there was no such way. The 
only way to *refer to* a bnode at all is to use a bnodeID, and 
bnodeID scopes are document-local; and although AFAIK this was 
nowhere stated explicitly, there is a universally understood 
assumption that distinctly described graphs do not 'accidentally' 
share bnodes, e.g. if I write two Ntriples documents then the RDF 
graphs they describe do not share a bnode. Under these circumstances, 
then, document scoping can be thought of as working exactly like a 
conventional logical notation in which RDF syntax *is* the document 
syntax, the bnodeIDs are existential variables bound by implicit 
quantifiers at the edge of documents, and bnodes are just a kind of 
irrelevant ghost of these bound variables in the RDF abstract graph 
space. But this is not in fact the picture described by the RDF 
specs, which say that the graph syntax is normative: and this 
simplified surface-syntax picture does not work for SPARQL, because 
SPARQL, for the first time, treats bnodes in ways that cannot be 
identified so straightforwardly with syntactic document scopes.

Now, to emphasize an obvious point: all this is about *syntax*, not 
semantics. By saying that a bnode has global scope, Im not saying 
that it doesn't have existential semantics: it does. The RDF specs 
are quite clear on this point, and nothing in the above denies 
anything in the specs. The only way that RDF is, er, weird, is in its 
unconventional graph syntax, which trades lexical issues of scope and 
the free/bound distinction for a simpler mathematical model of syntax 
which can be defined purely set-theoretically and uses simple bnode 
identity as its sole syntactic tool.

[More to come on RDF inference rules, leanness, told bnodes and 
E-entailment; but I'll send this now to avoid further delay.]

Pat


-- 
---------------------------------------------------------------------
IHMC		(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32502			(850)291 0667    cell
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Sunday, 20 August 2006 19:46:59 UTC