Re: Scope of blank nodes in RDF from Sandro Hawke on 2012-09-07 (public-rdf-wg@w3.org from September 2012)

From: Sandro Hawke <sandro@w3.org>
Date: Thu, 06 Sep 2012 23:24:26 -0400
To: Richard Cyganiak <richard@cyganiak.de>
CC: public-rdf-wg@w3.org
Message-ID: <504968EA.3070504@w3.org>
summary: in current RDF systems, blank nodes are sometimes shared 
between graphs, and I don't see how we can reasonably change things to 
make it otherwise.   Given that, lets make blank node labels in trig be 
document-scope.

more details:


On 09/06/2012 03:13 PM, Richard Cyganiak wrote:
> On 6 Sep 2012, at 17:30, Sandro Hawke wrote:
>> On 09/06/2012 10:02 AM, Richard Cyganiak wrote:
>>> Summary: In this message, I argue that:
>>>
>>> 1. Since RDF-WG is standardizing multigraphs and a notion of persistence for RDF data, we need to define the scope of blank nodes in the abstract syntax.
>> Ohhhh.     "the scope of blank nodes in the abstract syntax."    Interesting.
>>
>> I think we're crossing issues here, or something.     ISSUE-21 is about the scope of blank node *labels*.
> Right. But the proposal to resolve ISSUE-21 by saying that blank node labels in TriG and N-Quads have document scope only makes sense under the assumption that blank nodes can be shared between g-boxes. We have not yet formally defined what a g-box is. This ISSUE-21 proposal affects what we can say in our definition of g-boxes.

Right.  I'm starting from the assumption that blank nodes can be shared 
between g-boxes, because so many RDF APIs (including SPARQL) allow it.

So, my proposal is:

Proposal-1: We define blank node labels as having document-scope in 
TriG, settling an ambiguity in the original spec.  Documents and code 
written based on a different assumption about blank node label scope 
will have to be changed.

I see your counter-proposal:

Proposal-2: We change the abstract syntax of RDF such that blank nodes 
cannot be shared between graphs.

Even if this were desirable in terms of use cases (which I don't think 
it is), this seems like a non-started because of changes it would 
require in applications and libraries.    Even if SPARQL and Jena and 
rdflib, etc, etc, made this change, wouldn't it require changes in user 
apps?    Surely you're not suggesting that....  so I'm not sure what 
you're suggesting.       If we said blank nodes could no longer be 
shared between g-boxes/graphs, would you expect Jena and SPARQL to 
enforce that in some way?

For example, I don't think you'd say Jena MUST raise an exception when a 
user users the same blank node in triples being added to two graphs.   
But, then, what are you suggesting the systems do differently?

>> It sounds like you're talking about the scope of blank nodes themselves, in acting as logic symbols.
> Yes.
>
>> If you are, that would be an RDF-wide issue, not a     GRAPHS issue.
> It is an RDF-wide issue *and* a GRAPHS issue.
>
>> Let's see if I can be very clear about the difference here.
>>
>> 1.  ISSUE-21 (the scope of blank node labels in TriG).
>>
>> In an RDF serialization, there are bindings from blank node labels to blank nodes.   (In RDF/XML, the blank node labels are called     nodeIDs).   These bindings are per-document in Turtle.  The spec says:
>> A fresh RDF blank node is allocated for each unique blank node label in a document. Repeated use of the same blank node label identifies the same RDF blank node.
>> ... so the scope of blank node labels in Turtle is the document.
> Yes.
>
>> I meant ISSUE-21 to be asking what is the scope of blank node labels in TriG.   The options are (0) leave it ambiguous, (1) document scope, (2) scope to the graph, (3) scope to the curly brackets.
>>
>> (Options 2 and 3 differ only in the case where triples in a named graph are split into different curly-bracket expressions, which we decided to allow.)
>>
>> I'm in favor of option (1) because it allows expressing arbitrary datasets without Skolemizing and de-Skolemizing.
> Option (1) only makes sense if blank nodes can be shared between g-boxes. Options (0), (2) and (3) are consistent with a view that blank nodes cannot be shared between g-boxes.

Sure.     But blank nodes can, today, be shared between g-boxes. So 
option (1) makes sense....


>> 2.  "the scope of blank nodes in the abstract syntax"
>>
>> I'm not sure this concept makes sense.
> Let me ask you a question.
>
> Can two g-boxes share a blank node?

Yes.   This is something I've been thinking about since implementing my 
first RDF store in 2001.   I'm not sure this was the right design for 
RDF, but many tools have ended up allowing sharing of blank nodes 
between g-boxes, and now an unknown number of applications have taken 
advantage of that.

> If you answer no, then obviously blank nodes have scope.

(an aside: I'd prefer if you'd avoid that use of the word "scope", since 
it has so much potential to be confused with the question of scoping 
blank node identifiers.   But now that I see how you're using it, I 
think I can follow what you're saying.)

> If you answer yes, then let me ask you another question.
>
> Can two graph stores share a blank node?

This is something I've been thinking about since this morning, so I'm 
not so sure.   But, working from first principals, I'd have to say: yes.

Blank nodes are an "internal" thing.   Standards don't provide any way 
to get at them (except maybe our Skolem IRIs), but if two graph stores 
are cooperating using some private agreement (like they are actually 
handled by one running process) then yes, they could share blank nodes.

For example, a process might be storing one user-editable graph and 
presenting several views on that graph, such as 
that-graph-plus-entailments for various entailment regimes.    Each of 
those graphs could be presented in various APIs.  Perhaps they would 
appears as named graphs in one dataset; or they could appear as the 
default graphs in different datasets; or they could appear as named 
graphs in different datasets.   Anyway, end result, it makes perfect 
sense to have one blank node exist in multiple graph stores.

> If you answer no, then obviously blank nodes have scope.
>
> If you answer yes, then please explain to me how I can determine whether your graph store and my graph store share a blank node or not.

As yet there is no standard way to ask whether they are the same blank 
node, when they are in different graph stores.  So you'd have to use a 
non-standard API or a non-standard extension to SPARQL. Perhaps some 
kind of UNION DATASET extension.

I'm not suggesting anyone should ever make such a thing.

It seems like an easy enough thing to do within any one system (eg just 
compare ObjectID or something).  To do it interoperably between systems 
I imagine we'd do something like our Skolem IRIs -- the one minting the 
"fresh blank node" generates a uuid-like identifier for it, and it can 
move to other systems taking that with it.  Of course, then it's 
questionable whether it's really "blank", but if we need to pretend it 
is for compatibility, we can.

> If you can provide such an explanation, then you're right, we don't need to talk about the scope of blank nodes. I have not seen an explanation that works.

Blank nodes have a hard time moving between systems, because (by design) 
they are hard to talk about.   That's generally a good thing, I think.   
But in theory they can certainly exist in multiple g-snaps, g-boxes, 
datasets, graph stores, and even graph stores on multiple systems.

(At some point I'd like to understand why Oracle saw the need for stable 
blank node identifiers and didn't feel like they could just use IRIs.    
I think best practice is to use IRIs when you're going to want to use 
that graph node again in some later triples.)

> If you cannot provide such an explanation, then explain to me how this can be reconciled with the sentence in RDF 2004 and RDF 1.1 Concepts:
>
> [[
> Given two blank nodes, it is possible to determine whether or not they are the same.
> ]]
> http://www.w3.org/TR/rdf11-concepts/#section-blank-nodes

If one can be "given" the two nodes, then they must exist in the same 
system, and at that point it's easy to tell (eg compare ObjectID or some 
kind internal ID).    Between systems, one is basically never given a 
blank node; instead one is instructed to create a fresh one (as you 
point out below).

>> But I understand the idea that in the abstract syntax IRIs act like logical constants.   We've had some discussion about whether a given IRI necessarily denotes the same thing everywhere or not.  That is, do IRIs have global scope, or some kind of smaller scope?
> This is different. That was about the question whether an IRI denotes the same resource wherever it occurs. It is about the semantics. It's not what I'm talking about. I'm talking about the abstract syntax.
>
> What is the scope of IRIs in the abstract syntax? RDF 1.1 Concepts says:
>
> [[
> IRI equality: Two IRIs are equal if and only if they are equivalent under Simple String Comparison according to section 5.1 of [IRI]. Further normalizationmust not be performed when comparing IRIs for equality.
> ]]
> http://www.w3.org/TR/rdf11-concepts/#section-IRIs
>
> So, two IRIs are equal or unequal regardless of where they occur. They are global in scope. Every RDF graph in the world that uses the IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> does, in fact, use the same IRI.
>
> It is not so easy for blank nodes. Blank nodes in the abstract syntax have no identifier or any other kind of structure or properties that would allow us to tell whether two of them are the same by inspecting them.
>
> When we parse a Turtle document, then it tells us at what points we need to conjure a “fresh” blank node. A fresh blank node is different from every other blank node that already exists. That's what “fresh” means in this context. As long as we only talk about the static RDF graph that results from the parsing of the single document, we know which blank nodes are the same because the Turtle spec spells out how the graph is constructed from “fresh” blank nodes. So everything is fine.
>
> Now let's talk about g-boxes. Let's call it A. G-boxes have persistence. I can put a blank node into it, and the blank node stays there, right? The next day, it will still contain the same blank node, right? If I copy the contents of the g-box A into a different g-box B the next day, then both g-boxes contain the same blank node, right? And if I copy the contents of A into yet another g-box C the day after, then B and C contain the same blank node, right? This is the status quo with SPARQL Update, assuming that the slots of a graph store are g-boxes.
>
> I can postulate the existence of two g-boxes, one sitting in Ireland and one sitting in New Zealand, that share a blank node. Both hold and RDF graph containing a blank node. Is it the same blank node or not? I think the specs ought to answer that question. And I think they don't at the moment.

They're not very clear about it, true.   I think they hold the same 
blank node if they are constructed to do so, and don't if they are 
not.    In general, they are not, but some kind of federated RDF store 
could in theory be constructed where they were the same.   I don't 
advocate doing that -- I think it's best to keep blank nodes short lived 
and local.

(On the other hand, getting back to ISSUE-21, I do think it's important 
to be able to tell someone to construct a dataset using fresh blank 
nodes in certain places, creating a certain graph topology.     We can 
return to the use cases for that if necessary, but that seems less 
important right now.)

> (My preferred answer is: “You put a blank node from one g-box into another one, you get a new blank node.

That's also a common way to do things.  If you serialize & deserialize 
that's what you'll get.  If you want it to be the same blank node, you 
have to do some kind of internal copy or reference sharing.

>   Therefore, different g-boxes contain different blank nodes.” R2RML was written with the assumption that this is how it works.

With the assumption that (1) copying in R2RML creates fresh blank nodes, 
or (2) that there is no way to re-use blank nodes in different g-boxes?

I hope it's just (1), in which case that doesn't seem like it would be a 
problem.    (I don't know much about R2RML, though.)

> Unfortunately, SPARQL Update disagrees with this, as I have learned only recently. Another possible answer is: “Different graph stores contain different blank nodes.” That reduces the problem to the question whether two given graph stores are the same or not, and that's a question that is fairly unlikely to cause problems.)
>
>> So, in the same sense, blank nodes could have this kind of scope.  Maybe a given blank node could denote one thing in one situation or context and a different thing in a different situation or context.       I don't like this idea -- I think IRIs should have global scope (although I see some appeal to bending that rule), and I think blank nodes should definitely have global scope.   Since blank nodes tend to be very local, I don't see any pressure to reuse one blank node with a different meaning, to let it have another scope.
> It's not about denotation. It's only about the abstract syntax. How can I tell whether the blank node in your graph store is the same or different from the blank node in my graph store? RDF Concepts requires that we be able to.
>
>> a few more comments in-line below, although I can't say much until we sort out the above....
>>
>>> 2. SPARQL Update should already have defined the scope of blank nodes for graph stores, and in fact is in conflict with some wording in RDF Concepts because it didn't.
>>> 3. The proposed resolution on sharing blank node labels across graphs in TriG closes the door to the simplest and most obvious way of fixing the scope of blank nodes.
>>> 4. I propose a different way of fixing the scope of blank nodes. This proposal is (I believe) compatible with SPARQL Update as it stands, should resolve the conflict between RDF Concepts and SPARQL Update, and allows sharing of bnode labels in TriG.
>>>
>>> This got a bit long; sorry for that.
>>>
>>>
>>>
>>> RDF Concepts, both in the 2004 and 1.1 versions, contains the following normative sentence:
>>>
>>> [[
>>> Given two blank nodes, it is possible to determine whether or not they are the same.
>>> ]]
>>>
>>> This is a constraint on the RDF data model, and hence on any other spec that uses RDF.
>>>
>>> Before SPARQL Update, it was easy to see that all the RDF-related W3C specs meet this constraint. No spec had any notion of persistence. RDF documents, RDF graphs and RDF datasets can all be seen as static snapshots. Any blank nodes mentioned are distinct from any those mentioned in any other static snapshot.
>> Yes, before SPARQL update there was no W3C standard way to interact with a blank node outside the document used to create it.
> Yes.
>
>> But people have created ways; lots of APIs do it, and in the telecon, Souri and Zhe reported that Oracle decided to provide a syntactic mechanism as well (using stable blank node labels).
> Yes. They did this in the absence of a W3C standard. We've now reached a point where the lack of an official account is actually leading to different interpretations among different W3C Recommendations. The R2RML WG has shared the belief that “blank nodes cannot be shared between graphs in a SPARQL/RDF dataset” since 2010.

That's unfortunate.    :-(       Do you know ways the design of R2RML 
might have been different without this belief?
> I have now learned that SPARQL Update is designed around the contrary assumption. I accept that we probably need to consider SPARQL Update correct, and R2RML incorrect; but I think that RDF-WG should normatively settle the question or else we will keep getting funny problems.

Agreed, we should settle this.

I think two things are important:

- not breaking apps written to use blank nodes shared between graphs

- providing some decent mechanism for annotating graphs, eg for 
provenance, even when the graphs contain blank nodes.

The rest is probably a matter of taste.

>> I'm not sure whether Skolem IRIs will be another way to do this or not; it kind of depends how they end up being used.    If systems maintain long term stable mappings between the generated IRIs and internal blank nodes, then that will be another way to interact with blank nodes.    (This seems like a bad practice to me, so far, but I wont be too surprised if someone ends up finding it very useful.)
> These “long term stable mappings” will usually consist of appending the blank node's internal implementation-dependent ID to some sort of base URI that involves “/.well-known/genid/”.
>
>>> In SPARQL Update, we now have persistent blank nodes. I believe that Graph Stores as defined in SPARQL Update do not meet the normative constraint above.
>>>
>>> Thought experiment: I have a graph store. It lives on a disk somewhere. I make a copy of that disk, ship the copy around the world, and start it up. Now we have two graph stores with two different sets of endpoints. Do they still contain the same blank nodes or not?
>>>
>> Tricky question.    Similarly, what if you ship the original disk?   Or what if you just turn off the system and turn it back on?
>>
>> I think we need to focus on observable system behaviors.
> We also need to focus on the constraints that we put (or don't put) on existing and future specifications that use the “RDF dataset”, “graph store” and “g-box” concepts.
>
>> In these cases, I don't think there's any way to ask a system if they are the same blank node, so it doesn't matter.
> I've already said how this affects observable system behaviours.
>
>> (If it's maintain a stable Skolem mapping, then it would matter -- but then's it's barely a blank node any more....)
>>
>>> The normative sentence above means that the SPARQL Update spec (or RDF Concepts, if we put the definition there) needs to somehow give an answer to this question.
>>>
>>> Does the answer matter? Yes, because we want to do things like federating multiple graph stores into one graph store, and I can ask SPARQL queries where it matters whether these blank nodes from different graph stores are considered the same or not. So to implement such a federation engine, we need an answer.
>>>
>> I don't think the existing SPARQL syntaxes/protocols provide any way to get at this distinction, and I think that's probably good.
> That's not terribly relevant. We're defining an abstract syntax. Many query languages, dump formats and protocols are possible over that abstract syntax. Expecting the spec to answer how to merge two RDF datasets or two graph stores is certainly not unreasonable, as some future specs will probably need to merge datasets. And here, the question whether they can share blank nodes matters. It's the same as with RDF graphs, where we need to distinguish between “merge” and “union”, because of potential shared blank nodes.
>
> Do you expect that we define “RDF dataset merge” and “RDF dataset union”?

In my draft, I did:

    We define the union and merge of quadsets (and thus datasets) as the
    set merge of their constituent triples and quads; in the case of a
    merge, it is after any shared blank nodes have been renamed apart.

    http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-spaces/index.html#merge-and-union

> How do you merge/union two graph stores?

First tell me what it means to union two g-boxes, then I think I can 
answer that.

>> To put it differently, SPARQL doesn't provide any way to move a blank node from one endpoint to a different one.    They are opaque     and trapped within processes.
> There can be multiple endpoints over the same graph store. It will be very common to have various views onto the same graph store with different permissions and the like.
>
>>> It appears to me that SPARQL Update does not give an answer.
>>>
>>> My preferred approach to this issue would have been to adopt the axiom that blank nodes are scoped to a g-box, and hence different g-boxes contain different blank nodes; and then work out the consequences from that axiom.
>> How could blank nodes be "scoped" to g-boxes?   You mean if the same blank node occurs in two g-boxes (like the same variable name     occurring in two scopes in a program) it denotes something different?
> No, I meant what I said: Different g-boxes contain different blank nodes. It is, by definition, not possible to have the same blank node in two g-boxes.

(This was just a place where I was confused about what you were saying 
because of the different meanings of the word "scope" relating to blank 
nodes.)
>> That seems like a very bad idea.
> Certainly.
>
>> Or do you just mean blank nodes are forbidden from occurring in multiple g-boxes?
> Yes.
>
>> But that would break lots of deployed systems (eg 4-store, with its union-default graph).
> How much it actually breaks depends on how many stores actually have managed to get the same blank node into multiple graphs. It's not that easy!

It's pretty easy in most of the RDF APIs I know.   You do something like:

   n = BlankNode()
   g1.add(n,a,b);
   g2.add(n,a,b);

In others, it's impossible, because you have to do:

   n1 = g1.BlankNode()
   n2 = g2.BlankNode()
   g1.add(n1,a,b);
   g2.add(n2,a,b);

... and if you try to put n1 in g2, it's an error.

Actually, I don't think I've seen the second kind of API, although it 
seems reasonable enough.

>   And it might be possible to explain this issue away with skolem IRIs.

Yes.   I'm afraid that will actually be more confusing/complex, but I'm 
willing to explore that path a bit.

>>> SPARQL Update has already thrown a big wrench into the gears here by allowing blank nodes to be copied between graphs; but perhaps this problem could have still been explained away.
>>>
>>> But allowing blank nodes to be shared between graphs in TriG and N-Quads would definitely kill that approach. This is why I have opposed this sharing of blank nodes in yesterday's call.
>>>
>>>
>>>
>>> Now, another approach might be to adopt a different axiom:
>>>
>>> [[
>>> PROPOSAL: Two different graph stores can never share a blank node. Even if both graph stores are based on the same data (e.g., one is a copy or subset or view of the other), their blank nodes are, by definition, disjoint.
>>> ]]
>>>
>> I like that idea, but I don't think there is even a crisp notion of "different graph stores", so that might not work.
> Well, the definition has to be just crisp enough to make it unlikely that two reasonable individuals end up answering the question “are these two graph stores the same?” differently. That's not a very high bar.

If we have a master/slave replication setup, are the two graph stores 
the same?

(probably yes?)

If we have a master/delayed-mirror replication setup, are the two graph 
stores the same?

(probably no?)

Still a pretty high bar, I think.

Anyway, this email has gone on way too long.

        -- Sandro

>>> This should answer the question of blank node scope in the following way:
>>>
>>> 1. Within any concrete RDF document (TriG, Turtle, SPARQL results, etc.), blank nodes are scoped to that document, and the document syntax defines the rules that say whether two blank nodes are the same or not.
>> Sounds good, assuming you mean "blank node *labels* are scoped to that document".
> Each blank node label in a Turtle document represents a “fresh” blank node. This means that none of the represented blank node are shared with anything outside of the document. Hence the blank nodes are scoped to the document.
>
> (The blank node *labels* are scoped to the document too.)
>
>> If you want to conflate blank nodes and blank node labels, I want to see some proposed text changes for the Turtle document.
> But that's what Turtle already says.
>
>>> 2. Within any persistent graph store, blank nodes are scoped to the graph store.
>> Again, I don't have any idea what you mean by "scoped" here.
> Whether the blank node can be shared with the rest of the world outside of the graph store or not.
>
> Best,
> Richard
>
>
>
>>> 3. The abstract mathematical structures (RDF graphs, RDF datasets, SPARQL result sequences) are always either the result of parsing a concrete document, or are a static snapshot of a persistent graph store (or part thereof), and their scope is the document or persistent store.
>>>
>> That sounds okay.
>>
>>      - s
>>
>>>
>>> Thoughts?
>>>
>>> Best,
>>> Richard
>>>
>>>
>
Received on Friday, 7 September 2012 03:24:36 UTC