scope review and an altenative phrasing. from Pat Hayes on 2013-03-15 (public-rdf-wg@w3.org from March 2013)

From: Pat Hayes <phayes@ihmc.us>
Date: Fri, 15 Mar 2013 13:36:28 -0500
To: Antoine Zimmermann <antoine.zimmermann@emse.fr>
Cc: RDF WG <public-rdf-wg@w3.org>, "Peter F. Patel-Schneider" <pfpschneider@gmail.com>
Message-Id: <060ADF83-C841-470F-A5DB-944E5D2BFD05@ihmc.us>

Let me try to sum up this whole discussion about bnodes and scopes, as it seems to be getting out of hand. At the end I give (yet another) way to say the stuff without actually using the sc-word.

The basic issue is really extremely simple. Suppose you have a bunch of pieces of RDF content from a variety of sources, and you want to put it all together into one piece of RDF. How can you do this?

If there are no bnodes involved, this is easy and obvious: you just form a union of all those pieces of RDF. What this means in practice is, you take all the RDF document fragments, each describing some set of RDF triples, and you combine them into one document or datastructure (such as a dataset) and treat that larger document/structure as a single RDF document or structure. Maybe you don't actually create it, but just treat your original sources as though they formed such a document or structure. Whatever: conceptually, you "form the union" of he RDF fragments, so in our nice clean abstract model of RDF, the graph syntax level, we treat this as the set-theoretic union of the sets of triples you started with.

But what if there are bnodeIDs in the pieces of RDF content? How do you know which bnodeIDs refer to the same bnode, and which do not? You can't assume, just because a given ID occurs in one piece of RDF and also in another, that it means the same bnode in each case, and we all know why: because of those RDF pieces came from different places, the bnodeID used in one place might mean a different bnode than the same bnodeID used in a different place. This is all familiar to anyone who has worked with local variables or local identifiers: when you take one of these out of its syntactic scope, you need some way to record that scope context in order to be able to know which occurrences of the local variable are "the same" as others. And this point is widely appreciated by RDF implementers, of course.

But we also can't just assume that they all mean different bnodes, because if, as a matter of fact, both of these RDF pieces came from one source - for example, if they are both fragments of some larger RDF document - then those two occurrences of the bnodeID would, in fact, mean the same bnode. We need to know which pieces of RDF content share a common scope for their bnodeIDs, in order to know which bnodeIDs to treat as meaning the same bnode, and which not.

How did the 2004 semantics deal with this? It didn't deal with it, it simply punted on the issue, by overloading the word "graph". In 2004 we just defined two distinct operations, union and merge (which treats all the RDF fragments as having a different scopes, regardless of their source), and just hand-waved about which one to use. ("[Union] is appropriate when the information in the graphs comes from a single source, or where one is derived from the other by means of some valid inference process, as for example when applying an inference rule to add a triple to a graph. Merging... is appropriate when the graphs come from different sources and there is no justification for assuming that a blank node in one refers to the same entity as any blank node in the other." - 2004 Semantics, end of section 1.5)

Hand-waving about something this basic is bad. But there was worse hand-waving in the 2004 specifications, about what exactly was meant by "RDF graph". It defined RDF graph to be a set of triples, but it later defined equivalence of graphs (1:1 mapping between the bnodes, everything else the same) and said that it will "treat such equivalent graphs as identical".(2004 semantics, section 0.3) But now, this is really sloppy. Obviously it can't really have meant *identical*, as if it did, then standardizing blank nodes apart would have literally made no change to the actual graph, so merging and unioning would have been the same operation. What it actually meant was, "we will often not bother to distinguish between equivalent graphs as far as the semantics is concerned" or some such mathematician's phrase to indicate that they are going to be lazy about real distinctions, assuming that the reader can fill in the exact details if they need to. But that isn't a fair assumption in this case, which is how we got into this mess, because sometimes these distinctions do matter.

If G is a graph, then any subset of the triples in G is a subgraph of G. And of course a subgraph is, itself, a graph. That sounds OK, but already there is a problem, because the 2004 semantics also uses this notion of "RDF graph" to define the scope of blank nodes, in the semantic conditions for blank nodes. And it says, intending to be helpful but in fact causing huge confusion, that you think of blank nodes like existential variables: "[the semantics] effectively treats all blank nodes as having the same meaning as existentially quantified variables in the RDF graph in which they occur, and which have the scope of the entire graph." (2004 section 1.5).

Let me take the stance of being a cynical reader of the 2004 specs, at this point. My question is, *which* entire graph is it talking about? Every subset of your graph is a graph, and that subset, considered as a graph, is "entire" inside itself. So do these truth conditions apply to every subgraph of every graph? And therefore, in the limit, to every triple considered in isolation? Obviously that isn't the intention, but the actual statement does not rule it out. It all turns on that one word "entire". I am supposed to consider the "entire" graph here: but, to return to our first scenario, how do I know what the "entire" graph is, in general? If all I have is bits of RDF gleaned from many sources, some of these might be fragments from a single "entire" graph, others might well not. So my old problem has re-appeared in the guise of how to specify what the "entire" graphs are, that my fragments have been selected from.

The 2004 formal statement of the semantics of bnodes is even less helpful, it just defines the truth-conditions on a graph in terms of the bnodes *in that graph*. This has the insane consequence that the meaning of a graph might not be the same as the combined meaning of the triples in the graph, in general, because if we apply this to the separate one-triple subgraphs of

:a :p _:x .
:b :p _:x .

and conjoin the results, we don't get the same meaning as if we apply it to the whole graph. But what is wrong with applying it to each subgraph in turn, if we have (as the 2004 spec does) defined them both to be graphs? That weasel-word "entire" is lurking in the wings to try to avoid this problem, and in practice of course it is usually assumed robustly by everyone concerned, but it's not in the formal specification, and there's no way, using the 2004 conceptual base, to put it in there.

All of this is a real problem at the heart of the 2004 specifications. (And I feel rather strongly about it, because I was the one responsible for putting it there. So it is my mistake I am trying to fix here.) What is less widely appreciated, however, is that the 2004 RDF treatment of blank nodes also suffers from the opposite problem. Going back to our initial scenario of combining RDF fragments, suppose that we somehow know that all these fragments come from completely different, unrelated, sources, so any accidental re-use of a bnodeID in two of them is just a coincidence, and does not imply that these mean the same bnode. So, do we then know that these graphs share no blank nodes? It is widely assumed that we do, and RDF processors and RDF-savvy human beings operate on this assumption. But the 2004 specs do NOT say that this is true, and they even go to long lengths to work around the possibility that it could be false. That is, the 2004 mathematical model of bnodes in RDF graphs explicitly allow graphs from completely unrelated sources to "accidentally" share a blank node, even though this would not make any sense at all. So, not only do we not know whether the same bnodeID refers to the same blank node, we also don't know whether *different* bnodeIDs refer to the same blank node. Which, I submit, is crazy. It is so crazy that most RDF users don't even consider it as a possibility.

So, there *are* things wrong with the 2004 graph model. And these things that are wrong all have a common core source, which is that saying that an RDF graph is a set, and treating blank nodes as real things, gives the blank nodes an individual status which they should not have. A blank node is not a thing, with an independent existence: it is just a place in an abstract structure; and these abstract structures are abstractions of real documents and datastructures, and all the processing takes place at that 'surface' level where there are actual identifiers, and issues of identifier scope make sense.

This needs to be fixed.

The solution, which fixes all these problems and issues at a single stroke, is to put just enough extra structure into the conceptual mathematical model to make it clear how blank nodes are individuated between graphs. Intuitively, each actual RDF graph - what is currently being called a scoped graph - 'contains' its blank nodes, and these cannot be contained in a different scoped graph. A subgraph of a scoped graph might have the same scope, but scoped graphs can't overlap. That gives flesh to that crucial word "entire" used in the 2004 document. An "entire" graph is a scoped graph, a graph which fills up its scope. The 'origin' of a piece of RDF content - a graph fragment, a subgraph of a larger graph - is the scoped graph from which it came. Blank nodes live in a unique scope, so blank nodes from different scope(s)(d graphs) must be distinct.

If the idea of a "scope" is problematic, as it seems to be, we could define the idea entirely in terms of graphs, as follows.

Certain RDF graphs are "containing graphs". (Or some such term. They "contain" their bnodes.) We stipulate that every RDF graph is a subgraph of a containing graph, and two containing graphs cannot have blank nodes in common. The semantics for blank nodes are defined only for containing graphs. BnodeIDs are then local identifiers in (the documents or structures which describe) their containing graphs.

This allows phrasing such as saying of a subgraph that it is being "considered as a containing graph", which means that its bnodes are being considered to be local to it for the time being. So we can point out that the conjunction of two subgraphs considered as containing graphs, may not mean the same as the containing graph of which they are subsets. (But it does if the subgraphs are complete.)

We can then define the merge of a set S as a containing graph which is a union of equivalent graphs to all the graphs in S, with a 1:1 mapping from the blank nodes in S to those in the merge.

If y'all prefer, I could draft a version of this as a replacement for the 'scope graph' paragraphs in the current document. It needs language like, surface syntaxes MUST specify which graphs they describe are containing graphs, eg all the triples decribed by an RDF/XML document, all the triples described by a NTriples document, all the triples in a dataset.

Pat

------------------------------------------------------------
IHMC (850)434 8903 or (650)494 3973
40 South Alcaniz St. (850)202 4416 office
Pensacola (850)202 4440 fax
FL 32502 (850)291 0667 mobile
phayesAT-SIGNihmc.us http://www.ihmc.us/users/phayes

Received on Friday, 15 March 2013 18:37:02 UTC