- From: Pat Hayes <phayes@ihmc.us>
- Date: Sun, 20 Aug 2006 12:46:33 -0700
- To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
I will try here to explain why 'told bnodes' aren't inconsistent with
the RDF semantics (and RDF graph syntax), and why the RDF specs lead
naturally to a view of bnodes as 'anonymous identifiers' (although I
don't like that terminology) and why this isn't inconsistent with the
RDF specs or the basic RDF model. This will also, hopefully, explain
what I have been talking about in some of my exchanges with Bijan
regarding existentials, names, quantifiers, bnodes having global
scope and so on. This will take a while and require some discursive
background, so please bear with me.
First, it is important to distinguish between bnodes and bnodeIDs. In
the RDF abstract graph syntax, there are no bnode IDs: bnodes are,
literally, blank. Think of an RDF graph as a directed node-arc-node
network with URIrefs and literals at the nodes and URIrefs on the
arcs; then the blank nodes are, literally, blank nodes; nodes in this
network that have *no label at all*. Those diagrams *are* the
abstract syntax of RDF: but in order to lexicalize them in a
character string, it is necessary to introduce some way to indicate
when you are referring to the same blank node, so a document which
lexicalizes an RDF graph has to use bnodeIDs (or something similar)
to indicate this. But these IDs are not the bnodes themselves: in
fact, they can be naturally thought of as *denoting* the bnodes, so
that a lexicalized form for RDF (such as RDF/XML or Ntriples) is in
fact in a sense a meta-syntactic description of some RDF graph
structure.
The reason this is more than just a curiosity, is that it means that
bnodes are not lexical tokens. Since they have no lexical form, the
type/token distinction is meaningless for bnodes. Bnodes are globally
unique, and there are no relationships between them (other than
identity). There cannot be two occurrences of the 'same' blank node
in two disconnected graphs, as there can be two different occurrences
(tokens) of the same letter (type) in two separate documents. This is
an important way in which RDF graph syntax differs from conventional
lexical syntax. In a conventional logical syntax we might have two
expressions both using the same name:
(P x)
(Q x x)
and then binding expressions are used to indicate which of these
occurrences are considered to be 'the same variable' as others, which
we conventionally refer to as being in the same scope, so that
(exists (x)( and (P x)(Q x x)))
makes all the tokens of 'x' into instances of the same variable, while
(exists (x)(P x))
(exists (x)(Q x x))
makes different tokens into instances of distinct variables (which
happen to have the same lexical form, but this is no longer of any
logical importance.) All grade-school stuff, I know, but my point is
that none of this makes sense for bnodes. That second expression
would require *two distinct* bnodes to be encoded in RDF graph
syntax. You cannot re-use the same bnode in two different graphs.
Think about it: if you draw a node-arc diagram in one place and I
draw a different diagram in another place, and they both have blank
nodes in them, how can those possibly be the *same* blank node?
Unless, that is, we are both using the same diagram, with one node in
it, and both linking our arcs to it: but then if you are writing the
RDF equivalent of (P x) - which might look like
[blank node] ---rdf:type--->:P
and I am writing the equivalent of (Q x x), which might look like
this, excuse the ascii-art:
[blank node] --- :Q ---
/\ |
|__________________|
and if those really are the *same* bnode, then we have, together,
written the equivalent of the first existential, with both atoms in
the same binder scope: an RDF graph with a total of two (not three)
nodes and two arcs.
Now, the RDF concepts document describes the graph syntax using
mathematical language rather than talking about these pictures, but
the pictures are very much the driving intuition. (Hence the
terminology of RDF "graph" and blank "node", although these are not,
technically, graphs in the graph theory sense.) The mathematics says
simply that bnodes are an infinite set of 'entities' which are
different from URIrefs and literals. (Think of the set of anonymous
blobs in all the possible node-arc-node diagrams that there ever
could be. This set is global, to emphasize again: don't think of
bnodes as being like words built up from a finite vocabulary, say, or
as like anything constructed. They just *exist*.) Triples are
3-tuples of bnodes, literals and URIrefs, and then an RDF graph is a
set of triples, yadda yadda, you know the rest. Any such set can
obviously be written as a node-arc diagram (with suitable noises
about infinite cases) and any such diagram defines a set of triples.
But, notice that the spec does not say that *every* set of triples is
an RDF graph. In fact, it doesn't actually say anywhere what counts
as 'an RDF graph': it doesn't say where the 'boundary' of an RDF
graph is supposed to be, and it provides no way to indicate where
such boundaries lie. This is what I meant when I said to Bijan that
RDF has no quantifiers. There is nothing in the RDF graph syntax
which can be used to specify the edge or boundary of an RDF graph; no
scoping conventions, as it were: no quantifiers, no brackets or
parentheses. What makes a set of triples into an RDF graph is left
implicit or unstated in the RDF specs; and this is a deliberate
omission, because we explicitly did not want to create any such
boundaries. Part of the very idea of RDF, the rationale for inventing
it in the first place, was the idea of being able to gather together
scraps of information from a variety of sources and use it together
in one place. At some point in this process, you would decide that
you had 'an RDF graph', and do something with it; but this RDF graph
might not correspond to any document boundary or indeed any
syntactically marked boundary at all: it might be defined by any
number of criteria, and it might be highly temporary, or evanescent,
or virtual.
But, one might object, if RDF doesn't define where the boundaries of
an RDF graph are, and yet the notion of 'RDF graph' is so central to
the semantics, isn't the whole thing seriously underspecified? But in
fact, the underspecificity is only apparent. To explain this I need
to take a brief diversion into elementary logic, using conventional
quantifier/bound-variable language. What are quantifiers for, after
all? In full FO logic, they can be nested in all sorts of ways, but
in RDF there are no universal quantifiers (and in RDFS, the
'implicit' universal quantifiers represented by such things as
rdfs:subClass are all inside the scope of all the existentials, i.e.
it is an EA logic) and so we know that any bnode is simply
existentially quantified at the top level (not inside any other
quantifiers); and under these circumstances, the only point of having
quantifiers would be to indicate the scope of the particular
variable, to be able to distinguish things like the two cases in the
above example. But, note, this need arises *only if the same variable
name is used in two different scopes*. Consider for a moment a
variation on 'normal' logical notation in which every quantifier
binds a distinct bound variable, unique to the quantifier, so that
all bound variable names are 'standardized apart' simply as a matter
of well-formedness of the syntax, and every variable in the text is
bound by a unique quantifier. (This would be totally impractical with
a conventional notation, of course, since there would be no way to
know how to accidentally avoid re-using a name someone else had used
without you knowing about it, not to mention that we would quickly
run out of plausible variable names. But just imagine it, anyway. It
is very similar to some ancient logical notations which use lines or
circles rather than variable-name correspondences to indicate
connections between quantifiers and what they bind.) Under this
unique-variable-name condition, all quantifiers can be moved outwards
without changing the meaning of the expression. That is, if x does
not occur in Q and y does not occur in P, then
(exist (x) P) & (exists (y) Q)
is equivalent to
(exists (x y)(P & Q))
and as long as we can be confident that no accidental 'capturing' of
a name by a quantifier can occur, we can go on moving these
quantifiers outwards as far as we like. In fact we can imagine them
moved so far 'out' that their scope is the entire Web, at which point
they become unnecessary, since all the quantifier scopes are the
same, and all are global. Of course, they still conform to the
existential semantics, so are not really *identifiers*, and this only
works if we have some syntactic way to distinguish them from other
names; but of course we do, in RDF.
Since bnodes do have this syntactic uniqueness, by design, they can
be thought of in this way. And this is exactly the way that the RDF
model was intended to be used. Which is why there are no quantifiers
in RDF. What it amounts to is that there is no notion of 'lexical
scope' in RDF; the task (and it is essentially a syntactic task) that
scoping serves in a conventional notation has been replaced, in RDF
graph syntax, by the idea of bnode *identity*. If two chunks of RDF
have no bnodes in common, we can think of them as separate graphs
without changing the meaning of anything. In this case, the graph
formed by their union is equivalent to the conjunction of the two
graphs considered in isolation, so the graph boundaries are of little
importance, and do not affect any meanings: we can be careless about
where exactly the graph boundaries are. If they share a bnode, then
they are conceptually parts of a larger graph, their union. You *can*
think of a subgraph of a larger graph as being a graph in its own
right, but then the appropriate thing to do is to change the bnodes
in it which are shared with other parts of the graph, so as to keep
the bnode populations of the two distinct graphs separate: this leads
to the idea a merge rather than a union. There is no significance to
the particular choice of the *actual* bnodes, of course, since they
are blank; hence the convention that graph-equivalent graphs are
treated as effectively interchangeable. In effect, bnodes are like
interchangeable tokens with no particular significance in themselves,
other than being different from all the others. Again, think of blobs
in diagrams.
If two sets of RDF triples do not share a bnode, then one can think
of those bnodes as being like bound variables bound by quantifiers
just at the edge of the graphs. (Imagine that each bnode 'calls down'
its unique binding quantifier from the edge of the entire Web, to
surround this particular graph. That scope change doesn't affect any
meanings, since that quantifier doesn't bind anything else on the
entire Web, because bnodes are globally unique.) This allows the RDF
graph syntax to be treated very similarly to a conventional syntax,
almost all the time, and why one can often manage quite well without
bothering about the distinction between bnodes and bnodeIDs. One can
'think lexically', treating the bnodeID scopes as though they were
bnode scopes. But in fact, this is rather sloppy, and only works as
long as the ID scopes - which are determined by document conventions
rather than by the RDF specs - are naturally aligned with
co-occurrences of bnodes in the described graphs. They will be, as
long as there is no way for two distinct documents to describe graphs
with a bnode in common: and before SPARQL, there was no such way. The
only way to *refer to* a bnode at all is to use a bnodeID, and
bnodeID scopes are document-local; and although AFAIK this was
nowhere stated explicitly, there is a universally understood
assumption that distinctly described graphs do not 'accidentally'
share bnodes, e.g. if I write two Ntriples documents then the RDF
graphs they describe do not share a bnode. Under these circumstances,
then, document scoping can be thought of as working exactly like a
conventional logical notation in which RDF syntax *is* the document
syntax, the bnodeIDs are existential variables bound by implicit
quantifiers at the edge of documents, and bnodes are just a kind of
irrelevant ghost of these bound variables in the RDF abstract graph
space. But this is not in fact the picture described by the RDF
specs, which say that the graph syntax is normative: and this
simplified surface-syntax picture does not work for SPARQL, because
SPARQL, for the first time, treats bnodes in ways that cannot be
identified so straightforwardly with syntactic document scopes.
Now, to emphasize an obvious point: all this is about *syntax*, not
semantics. By saying that a bnode has global scope, Im not saying
that it doesn't have existential semantics: it does. The RDF specs
are quite clear on this point, and nothing in the above denies
anything in the specs. The only way that RDF is, er, weird, is in its
unconventional graph syntax, which trades lexical issues of scope and
the free/bound distinction for a simpler mathematical model of syntax
which can be defined purely set-theoretically and uses simple bnode
identity as its sole syntactic tool.
[More to come on RDF inference rules, leanness, told bnodes and
E-entailment; but I'll send this now to avoid further delay.]
Pat
--
---------------------------------------------------------------------
IHMC (850)434 8903 or (650)494 3973 home
40 South Alcaniz St. (850)202 4416 office
Pensacola (850)202 4440 fax
FL 32502 (850)291 0667 cell
phayesAT-SIGNihmc.us http://www.ihmc.us/users/phayes
Received on Sunday, 20 August 2006 19:46:59 UTC