Re: Scope of blank nodes in RDF from Richard Cyganiak on 2012-09-06 (public-rdf-wg@w3.org from September 2012)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Thu, 6 Sep 2012 20:13:42 +0100
To: Sandro Hawke <sandro@w3.org>
Cc: public-rdf-wg@w3.org
Message-Id: <16F2F8D6-256C-4D28-816D-6BD6DAB53015@cyganiak.de>
On 6 Sep 2012, at 17:30, Sandro Hawke wrote:
> On 09/06/2012 10:02 AM, Richard Cyganiak wrote:
>> Summary: In this message, I argue that:
>> 
>> 1. Since RDF-WG is standardizing multigraphs and a notion of persistence for RDF data, we need to define the scope of blank nodes in the abstract syntax.
> 
> Ohhhh.     "the scope of blank nodes in the abstract syntax."    Interesting.
> 
> I think we're crossing issues here, or something.     ISSUE-21 is about the scope of blank node *labels*.  

Right. But the proposal to resolve ISSUE-21 by saying that blank node labels in TriG and N-Quads have document scope only makes sense under the assumption that blank nodes can be shared between g-boxes. We have not yet formally defined what a g-box is. This ISSUE-21 proposal affects what we can say in our definition of g-boxes.

> It sounds like you're talking about the scope of blank nodes themselves, in acting as logic symbols.  

Yes.

> If you are, that would be an RDF-wide issue, not a     GRAPHS issue.

It is an RDF-wide issue *and* a GRAPHS issue.

> Let's see if I can be very clear about the difference here.
> 
> 1.  ISSUE-21 (the scope of blank node labels in TriG).
> 
> In an RDF serialization, there are bindings from blank node labels to blank nodes.   (In RDF/XML, the blank node labels are called     nodeIDs).   These bindings are per-document in Turtle.  The spec says:
> A fresh RDF blank node is allocated for each unique blank node label in a document. Repeated use of the same blank node label identifies the same RDF blank node.
> ... so the scope of blank node labels in Turtle is the document.  

Yes.

> I meant ISSUE-21 to be asking what is the scope of blank node labels in TriG.   The options are (0) leave it ambiguous, (1) document scope, (2) scope to the graph, (3) scope to the curly brackets.    
> 
> (Options 2 and 3 differ only in the case where triples in a named graph are split into different curly-bracket expressions, which we decided to allow.)
> 
> I'm in favor of option (1) because it allows expressing arbitrary datasets without Skolemizing and de-Skolemizing.

Option (1) only makes sense if blank nodes can be shared between g-boxes. Options (0), (2) and (3) are consistent with a view that blank nodes cannot be shared between g-boxes.

> 2.  "the scope of blank nodes in the abstract syntax" 
> 
> I'm not sure this concept makes sense.  

Let me ask you a question.

Can two g-boxes share a blank node?

If you answer no, then obviously blank nodes have scope.

If you answer yes, then let me ask you another question.

Can two graph stores share a blank node?

If you answer no, then obviously blank nodes have scope.

If you answer yes, then please explain to me how I can determine whether your graph store and my graph store share a blank node or not.

If you can provide such an explanation, then you're right, we don't need to talk about the scope of blank nodes. I have not seen an explanation that works.

If you cannot provide such an explanation, then explain to me how this can be reconciled with the sentence in RDF 2004 and RDF 1.1 Concepts:

[[
Given two blank nodes, it is possible to determine whether or not they are the same.
]]
http://www.w3.org/TR/rdf11-concepts/#section-blank-nodes

> But I understand the idea that in the abstract syntax IRIs act like logical constants.   We've had some discussion about whether a given IRI necessarily denotes the same thing everywhere or not.  That is, do IRIs have global scope, or some kind of smaller scope?

This is different. That was about the question whether an IRI denotes the same resource wherever it occurs. It is about the semantics. It's not what I'm talking about. I'm talking about the abstract syntax.

What is the scope of IRIs in the abstract syntax? RDF 1.1 Concepts says:

[[
IRI equality: Two IRIs are equal if and only if they are equivalent under Simple String Comparison according to section 5.1 of [IRI]. Further normalizationmust not be performed when comparing IRIs for equality.
]]
http://www.w3.org/TR/rdf11-concepts/#section-IRIs

So, two IRIs are equal or unequal regardless of where they occur. They are global in scope. Every RDF graph in the world that uses the IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> does, in fact, use the same IRI.

It is not so easy for blank nodes. Blank nodes in the abstract syntax have no identifier or any other kind of structure or properties that would allow us to tell whether two of them are the same by inspecting them.

When we parse a Turtle document, then it tells us at what points we need to conjure a “fresh” blank node. A fresh blank node is different from every other blank node that already exists. That's what “fresh” means in this context. As long as we only talk about the static RDF graph that results from the parsing of the single document, we know which blank nodes are the same because the Turtle spec spells out how the graph is constructed from “fresh” blank nodes. So everything is fine.

Now let's talk about g-boxes. Let's call it A. G-boxes have persistence. I can put a blank node into it, and the blank node stays there, right? The next day, it will still contain the same blank node, right? If I copy the contents of the g-box A into a different g-box B the next day, then both g-boxes contain the same blank node, right? And if I copy the contents of A into yet another g-box C the day after, then B and C contain the same blank node, right? This is the status quo with SPARQL Update, assuming that the slots of a graph store are g-boxes.

I can postulate the existence of two g-boxes, one sitting in Ireland and one sitting in New Zealand, that share a blank node. Both hold and RDF graph containing a blank node. Is it the same blank node or not? I think the specs ought to answer that question. And I think they don't at the moment.

(My preferred answer is: “You put a blank node from one g-box into another one, you get a new blank node. Therefore, different g-boxes contain different blank nodes.” R2RML was written with the assumption that this is how it works. Unfortunately, SPARQL Update disagrees with this, as I have learned only recently. Another possible answer is: “Different graph stores contain different blank nodes.” That reduces the problem to the question whether two given graph stores are the same or not, and that's a question that is fairly unlikely to cause problems.)

> So, in the same sense, blank nodes could have this kind of scope.  Maybe a given blank node could denote one thing in one situation or context and a different thing in a different situation or context.       I don't like this idea -- I think IRIs should have global scope (although I see some appeal to bending that rule), and I think blank nodes should definitely have global scope.   Since blank nodes tend to be very local, I don't see any pressure to reuse one blank node with a different meaning, to let it have another scope.

It's not about denotation. It's only about the abstract syntax. How can I tell whether the blank node in your graph store is the same or different from the blank node in my graph store? RDF Concepts requires that we be able to.

> a few more comments in-line below, although I can't say much until we sort out the above....
> 
>> 2. SPARQL Update should already have defined the scope of blank nodes for graph stores, and in fact is in conflict with some wording in RDF Concepts because it didn't.
>> 3. The proposed resolution on sharing blank node labels across graphs in TriG closes the door to the simplest and most obvious way of fixing the scope of blank nodes.
>> 4. I propose a different way of fixing the scope of blank nodes. This proposal is (I believe) compatible with SPARQL Update as it stands, should resolve the conflict between RDF Concepts and SPARQL Update, and allows sharing of bnode labels in TriG.
>> 
>> This got a bit long; sorry for that.
>> 
>> 
>> 
>> RDF Concepts, both in the 2004 and 1.1 versions, contains the following normative sentence:
>> 
>> [[
>> Given two blank nodes, it is possible to determine whether or not they are the same.
>> ]]
>> 
>> This is a constraint on the RDF data model, and hence on any other spec that uses RDF.
>> 
>> Before SPARQL Update, it was easy to see that all the RDF-related W3C specs meet this constraint. No spec had any notion of persistence. RDF documents, RDF graphs and RDF datasets can all be seen as static snapshots. Any blank nodes mentioned are distinct from any those mentioned in any other static snapshot.
> 
> Yes, before SPARQL update there was no W3C standard way to interact with a blank node outside the document used to create it.    

Yes.

> But people have created ways; lots of APIs do it, and in the telecon, Souri and Zhe reported that Oracle decided to provide a syntactic mechanism as well (using stable blank node labels).

Yes. They did this in the absence of a W3C standard. We've now reached a point where the lack of an official account is actually leading to different interpretations among different W3C Recommendations. The R2RML WG has shared the belief that “blank nodes cannot be shared between graphs in a SPARQL/RDF dataset” since 2010. I have now learned that SPARQL Update is designed around the contrary assumption. I accept that we probably need to consider SPARQL Update correct, and R2RML incorrect; but I think that RDF-WG should normatively settle the question or else we will keep getting funny problems.

> I'm not sure whether Skolem IRIs will be another way to do this or not; it kind of depends how they end up being used.    If systems maintain long term stable mappings between the generated IRIs and internal blank nodes, then that will be another way to interact with blank nodes.    (This seems like a bad practice to me, so far, but I wont be too surprised if someone ends up finding it very useful.) 

These “long term stable mappings” will usually consist of appending the blank node's internal implementation-dependent ID to some sort of base URI that involves “/.well-known/genid/”.

>> In SPARQL Update, we now have persistent blank nodes. I believe that Graph Stores as defined in SPARQL Update do not meet the normative constraint above.
>> 
>> Thought experiment: I have a graph store. It lives on a disk somewhere. I make a copy of that disk, ship the copy around the world, and start it up. Now we have two graph stores with two different sets of endpoints. Do they still contain the same blank nodes or not?
>> 
> 
> Tricky question.    Similarly, what if you ship the original disk?   Or what if you just turn off the system and turn it back on?
> 
> I think we need to focus on observable system behaviors.

We also need to focus on the constraints that we put (or don't put) on existing and future specifications that use the “RDF dataset”, “graph store” and “g-box” concepts.

> In these cases, I don't think there's any way to ask a system if they are the same blank node, so it doesn't matter.

I've already said how this affects observable system behaviours.

> (If it's maintain a stable Skolem mapping, then it would matter -- but then's it's barely a blank node any more....)
> 
>> The normative sentence above means that the SPARQL Update spec (or RDF Concepts, if we put the definition there) needs to somehow give an answer to this question.
>> 
>> Does the answer matter? Yes, because we want to do things like federating multiple graph stores into one graph store, and I can ask SPARQL queries where it matters whether these blank nodes from different graph stores are considered the same or not. So to implement such a federation engine, we need an answer.
>> 
> 
> I don't think the existing SPARQL syntaxes/protocols provide any way to get at this distinction, and I think that's probably good.

That's not terribly relevant. We're defining an abstract syntax. Many query languages, dump formats and protocols are possible over that abstract syntax. Expecting the spec to answer how to merge two RDF datasets or two graph stores is certainly not unreasonable, as some future specs will probably need to merge datasets. And here, the question whether they can share blank nodes matters. It's the same as with RDF graphs, where we need to distinguish between “merge” and “union”, because of potential shared blank nodes.

Do you expect that we define “RDF dataset merge” and “RDF dataset union”?

How do you merge/union two graph stores?

> To put it differently, SPARQL doesn't provide any way to move a blank node from one endpoint to a different one.    They are opaque     and trapped within processes.

There can be multiple endpoints over the same graph store. It will be very common to have various views onto the same graph store with different permissions and the like.

>> It appears to me that SPARQL Update does not give an answer.
>> 
>> My preferred approach to this issue would have been to adopt the axiom that blank nodes are scoped to a g-box, and hence different g-boxes contain different blank nodes; and then work out the consequences from that axiom.
> 
> How could blank nodes be "scoped" to g-boxes?   You mean if the same blank node occurs in two g-boxes (like the same variable name     occurring in two scopes in a program) it denotes something different?  

No, I meant what I said: Different g-boxes contain different blank nodes. It is, by definition, not possible to have the same blank node in two g-boxes.

> That seems like a very bad idea.  

Certainly.

> Or do you just mean blank nodes are forbidden from occurring in multiple g-boxes?  

Yes.

> But that would break lots of deployed systems (eg 4-store, with its union-default graph).

How much it actually breaks depends on how many stores actually have managed to get the same blank node into multiple graphs. It's not that easy! And it might be possible to explain this issue away with skolem IRIs.
> 
>> SPARQL Update has already thrown a big wrench into the gears here by allowing blank nodes to be copied between graphs; but perhaps this problem could have still been explained away.
>> 
>> But allowing blank nodes to be shared between graphs in TriG and N-Quads would definitely kill that approach. This is why I have opposed this sharing of blank nodes in yesterday's call.
>> 
>> 
>> 
>> Now, another approach might be to adopt a different axiom:
>> 
>> [[
>> PROPOSAL: Two different graph stores can never share a blank node. Even if both graph stores are based on the same data (e.g., one is a copy or subset or view of the other), their blank nodes are, by definition, disjoint.
>> ]]
>> 
> 
> I like that idea, but I don't think there is even a crisp notion of "different graph stores", so that might not work.

Well, the definition has to be just crisp enough to make it unlikely that two reasonable individuals end up answering the question “are these two graph stores the same?” differently. That's not a very high bar.

>> This should answer the question of blank node scope in the following way:
>> 
>> 1. Within any concrete RDF document (TriG, Turtle, SPARQL results, etc.), blank nodes are scoped to that document, and the document syntax defines the rules that say whether two blank nodes are the same or not.
> 
> Sounds good, assuming you mean "blank node *labels* are scoped to that document".  

Each blank node label in a Turtle document represents a “fresh” blank node. This means that none of the represented blank node are shared with anything outside of the document. Hence the blank nodes are scoped to the document.

(The blank node *labels* are scoped to the document too.)

> If you want to conflate blank nodes and blank node labels, I want to see some proposed text changes for the Turtle document.

But that's what Turtle already says.

>> 2. Within any persistent graph store, blank nodes are scoped to the graph store.
> 
> Again, I don't have any idea what you mean by "scoped" here.

Whether the blank node can be shared with the rest of the world outside of the graph store or not.

Best,
Richard



>> 3. The abstract mathematical structures (RDF graphs, RDF datasets, SPARQL result sequences) are always either the result of parsing a concrete document, or are a static snapshot of a persistent graph store (or part thereof), and their scope is the document or persistent store.
>> 
> 
> That sounds okay.
> 
>     - s
> 
>> 
>> 
>> Thoughts?
>> 
>> Best,
>> Richard
>> 
>> 
>
Received on Thursday, 6 September 2012 19:14:13 UTC