- From: Pat Hayes <phayes@ihmc.us>
- Date: Sun, 20 Aug 2006 12:46:33 -0700
- To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
I will try here to explain why 'told bnodes' aren't inconsistent with the RDF semantics (and RDF graph syntax), and why the RDF specs lead naturally to a view of bnodes as 'anonymous identifiers' (although I don't like that terminology) and why this isn't inconsistent with the RDF specs or the basic RDF model. This will also, hopefully, explain what I have been talking about in some of my exchanges with Bijan regarding existentials, names, quantifiers, bnodes having global scope and so on. This will take a while and require some discursive background, so please bear with me. First, it is important to distinguish between bnodes and bnodeIDs. In the RDF abstract graph syntax, there are no bnode IDs: bnodes are, literally, blank. Think of an RDF graph as a directed node-arc-node network with URIrefs and literals at the nodes and URIrefs on the arcs; then the blank nodes are, literally, blank nodes; nodes in this network that have *no label at all*. Those diagrams *are* the abstract syntax of RDF: but in order to lexicalize them in a character string, it is necessary to introduce some way to indicate when you are referring to the same blank node, so a document which lexicalizes an RDF graph has to use bnodeIDs (or something similar) to indicate this. But these IDs are not the bnodes themselves: in fact, they can be naturally thought of as *denoting* the bnodes, so that a lexicalized form for RDF (such as RDF/XML or Ntriples) is in fact in a sense a meta-syntactic description of some RDF graph structure. The reason this is more than just a curiosity, is that it means that bnodes are not lexical tokens. Since they have no lexical form, the type/token distinction is meaningless for bnodes. Bnodes are globally unique, and there are no relationships between them (other than identity). There cannot be two occurrences of the 'same' blank node in two disconnected graphs, as there can be two different occurrences (tokens) of the same letter (type) in two separate documents. This is an important way in which RDF graph syntax differs from conventional lexical syntax. In a conventional logical syntax we might have two expressions both using the same name: (P x) (Q x x) and then binding expressions are used to indicate which of these occurrences are considered to be 'the same variable' as others, which we conventionally refer to as being in the same scope, so that (exists (x)( and (P x)(Q x x))) makes all the tokens of 'x' into instances of the same variable, while (exists (x)(P x)) (exists (x)(Q x x)) makes different tokens into instances of distinct variables (which happen to have the same lexical form, but this is no longer of any logical importance.) All grade-school stuff, I know, but my point is that none of this makes sense for bnodes. That second expression would require *two distinct* bnodes to be encoded in RDF graph syntax. You cannot re-use the same bnode in two different graphs. Think about it: if you draw a node-arc diagram in one place and I draw a different diagram in another place, and they both have blank nodes in them, how can those possibly be the *same* blank node? Unless, that is, we are both using the same diagram, with one node in it, and both linking our arcs to it: but then if you are writing the RDF equivalent of (P x) - which might look like [blank node] ---rdf:type--->:P and I am writing the equivalent of (Q x x), which might look like this, excuse the ascii-art: [blank node] --- :Q --- /\ | |__________________| and if those really are the *same* bnode, then we have, together, written the equivalent of the first existential, with both atoms in the same binder scope: an RDF graph with a total of two (not three) nodes and two arcs. Now, the RDF concepts document describes the graph syntax using mathematical language rather than talking about these pictures, but the pictures are very much the driving intuition. (Hence the terminology of RDF "graph" and blank "node", although these are not, technically, graphs in the graph theory sense.) The mathematics says simply that bnodes are an infinite set of 'entities' which are different from URIrefs and literals. (Think of the set of anonymous blobs in all the possible node-arc-node diagrams that there ever could be. This set is global, to emphasize again: don't think of bnodes as being like words built up from a finite vocabulary, say, or as like anything constructed. They just *exist*.) Triples are 3-tuples of bnodes, literals and URIrefs, and then an RDF graph is a set of triples, yadda yadda, you know the rest. Any such set can obviously be written as a node-arc diagram (with suitable noises about infinite cases) and any such diagram defines a set of triples. But, notice that the spec does not say that *every* set of triples is an RDF graph. In fact, it doesn't actually say anywhere what counts as 'an RDF graph': it doesn't say where the 'boundary' of an RDF graph is supposed to be, and it provides no way to indicate where such boundaries lie. This is what I meant when I said to Bijan that RDF has no quantifiers. There is nothing in the RDF graph syntax which can be used to specify the edge or boundary of an RDF graph; no scoping conventions, as it were: no quantifiers, no brackets or parentheses. What makes a set of triples into an RDF graph is left implicit or unstated in the RDF specs; and this is a deliberate omission, because we explicitly did not want to create any such boundaries. Part of the very idea of RDF, the rationale for inventing it in the first place, was the idea of being able to gather together scraps of information from a variety of sources and use it together in one place. At some point in this process, you would decide that you had 'an RDF graph', and do something with it; but this RDF graph might not correspond to any document boundary or indeed any syntactically marked boundary at all: it might be defined by any number of criteria, and it might be highly temporary, or evanescent, or virtual. But, one might object, if RDF doesn't define where the boundaries of an RDF graph are, and yet the notion of 'RDF graph' is so central to the semantics, isn't the whole thing seriously underspecified? But in fact, the underspecificity is only apparent. To explain this I need to take a brief diversion into elementary logic, using conventional quantifier/bound-variable language. What are quantifiers for, after all? In full FO logic, they can be nested in all sorts of ways, but in RDF there are no universal quantifiers (and in RDFS, the 'implicit' universal quantifiers represented by such things as rdfs:subClass are all inside the scope of all the existentials, i.e. it is an EA logic) and so we know that any bnode is simply existentially quantified at the top level (not inside any other quantifiers); and under these circumstances, the only point of having quantifiers would be to indicate the scope of the particular variable, to be able to distinguish things like the two cases in the above example. But, note, this need arises *only if the same variable name is used in two different scopes*. Consider for a moment a variation on 'normal' logical notation in which every quantifier binds a distinct bound variable, unique to the quantifier, so that all bound variable names are 'standardized apart' simply as a matter of well-formedness of the syntax, and every variable in the text is bound by a unique quantifier. (This would be totally impractical with a conventional notation, of course, since there would be no way to know how to accidentally avoid re-using a name someone else had used without you knowing about it, not to mention that we would quickly run out of plausible variable names. But just imagine it, anyway. It is very similar to some ancient logical notations which use lines or circles rather than variable-name correspondences to indicate connections between quantifiers and what they bind.) Under this unique-variable-name condition, all quantifiers can be moved outwards without changing the meaning of the expression. That is, if x does not occur in Q and y does not occur in P, then (exist (x) P) & (exists (y) Q) is equivalent to (exists (x y)(P & Q)) and as long as we can be confident that no accidental 'capturing' of a name by a quantifier can occur, we can go on moving these quantifiers outwards as far as we like. In fact we can imagine them moved so far 'out' that their scope is the entire Web, at which point they become unnecessary, since all the quantifier scopes are the same, and all are global. Of course, they still conform to the existential semantics, so are not really *identifiers*, and this only works if we have some syntactic way to distinguish them from other names; but of course we do, in RDF. Since bnodes do have this syntactic uniqueness, by design, they can be thought of in this way. And this is exactly the way that the RDF model was intended to be used. Which is why there are no quantifiers in RDF. What it amounts to is that there is no notion of 'lexical scope' in RDF; the task (and it is essentially a syntactic task) that scoping serves in a conventional notation has been replaced, in RDF graph syntax, by the idea of bnode *identity*. If two chunks of RDF have no bnodes in common, we can think of them as separate graphs without changing the meaning of anything. In this case, the graph formed by their union is equivalent to the conjunction of the two graphs considered in isolation, so the graph boundaries are of little importance, and do not affect any meanings: we can be careless about where exactly the graph boundaries are. If they share a bnode, then they are conceptually parts of a larger graph, their union. You *can* think of a subgraph of a larger graph as being a graph in its own right, but then the appropriate thing to do is to change the bnodes in it which are shared with other parts of the graph, so as to keep the bnode populations of the two distinct graphs separate: this leads to the idea a merge rather than a union. There is no significance to the particular choice of the *actual* bnodes, of course, since they are blank; hence the convention that graph-equivalent graphs are treated as effectively interchangeable. In effect, bnodes are like interchangeable tokens with no particular significance in themselves, other than being different from all the others. Again, think of blobs in diagrams. If two sets of RDF triples do not share a bnode, then one can think of those bnodes as being like bound variables bound by quantifiers just at the edge of the graphs. (Imagine that each bnode 'calls down' its unique binding quantifier from the edge of the entire Web, to surround this particular graph. That scope change doesn't affect any meanings, since that quantifier doesn't bind anything else on the entire Web, because bnodes are globally unique.) This allows the RDF graph syntax to be treated very similarly to a conventional syntax, almost all the time, and why one can often manage quite well without bothering about the distinction between bnodes and bnodeIDs. One can 'think lexically', treating the bnodeID scopes as though they were bnode scopes. But in fact, this is rather sloppy, and only works as long as the ID scopes - which are determined by document conventions rather than by the RDF specs - are naturally aligned with co-occurrences of bnodes in the described graphs. They will be, as long as there is no way for two distinct documents to describe graphs with a bnode in common: and before SPARQL, there was no such way. The only way to *refer to* a bnode at all is to use a bnodeID, and bnodeID scopes are document-local; and although AFAIK this was nowhere stated explicitly, there is a universally understood assumption that distinctly described graphs do not 'accidentally' share bnodes, e.g. if I write two Ntriples documents then the RDF graphs they describe do not share a bnode. Under these circumstances, then, document scoping can be thought of as working exactly like a conventional logical notation in which RDF syntax *is* the document syntax, the bnodeIDs are existential variables bound by implicit quantifiers at the edge of documents, and bnodes are just a kind of irrelevant ghost of these bound variables in the RDF abstract graph space. But this is not in fact the picture described by the RDF specs, which say that the graph syntax is normative: and this simplified surface-syntax picture does not work for SPARQL, because SPARQL, for the first time, treats bnodes in ways that cannot be identified so straightforwardly with syntactic document scopes. Now, to emphasize an obvious point: all this is about *syntax*, not semantics. By saying that a bnode has global scope, Im not saying that it doesn't have existential semantics: it does. The RDF specs are quite clear on this point, and nothing in the above denies anything in the specs. The only way that RDF is, er, weird, is in its unconventional graph syntax, which trades lexical issues of scope and the free/bound distinction for a simpler mathematical model of syntax which can be defined purely set-theoretically and uses simple bnode identity as its sole syntactic tool. [More to come on RDF inference rules, leanness, told bnodes and E-entailment; but I'll send this now to avoid further delay.] Pat -- --------------------------------------------------------------------- IHMC (850)434 8903 or (650)494 3973 home 40 South Alcaniz St. (850)202 4416 office Pensacola (850)202 4440 fax FL 32502 (850)291 0667 cell phayesAT-SIGNihmc.us http://www.ihmc.us/users/phayes
Received on Sunday, 20 August 2006 19:46:59 UTC