Blank nodes and SQL sequences (was: Re: Why blank nodes?)

Steve,

On 7 Sep 2012, at 11:35, Steve Harris wrote:
> Taking a step back, and thinking about what we (Experian) actually use bNodes for, to inform our position on the various scope questions.
> 
> Basically, it's just a replacement for auto_increment columns in SQL.

I'd like to take this metaphor a bit further.

AUTO_INCREMENT is a MySQL-specific feature. It's a way of getting guaranteed unique identifiers within the scope of a MySQL table.

The standard SQL equivalent is the SEQUENCE. It's not bound to a specific table, but needs to be created explicitly:

  CREATE SEQUENCE customer_seq INCREMENT BY 1 START WITH 1;

Then when I want to insert a new row into my customer table, I can grab a “fresh” value from the sequence using the expression customer_seq.NEXTVAL:

  INSERT INTO customer (cust_id, name, address)
  VALUES (customer_seq.NEXTVAL, 'John Doe', '123 Main St.');

A SEQUENCE guarantees that successive calls to NEXTVAL will return different values. AUTO_INCREMENT is just like a SEQUENCE that's tightly bound to the table.


So this is a lot like blank nodes in RDF, if we ignore concrete syntaxes and the semantics, and just look at the data model. RDF Concepts says:

[[
The blank nodes in an RDF graph are drawn from an infinite set. This set is disjoint from the set of all IRIs and the set of all literals. Otherwise, this set of blank nodes is arbitrary. Given two blank nodes, it is possible to determine whether or not they are the same. Besides that, RDF makes no reference to any internal structure of blank nodes.
]]

So, when we talk about “allocating a fresh blank node”, we really pull a NEXTVAL from this infinite sequence of blank nodes.

The thing is, RDF Concepts doesn't say what the “scope” of this “sequence” is. The “sequence” is not bound to one “table” like in MySQL. It's not explicitly created and explicitly referenced like in vanilla SQL. It's all sort of left to implementations.

Jena, I believe, assumes “one big universal sequence of blank nodes in the sky”, and the uniqueness of values within the sequence is only stochastically guaranteed.

In other implementations, the “sequence” is managed by the RDF parser, and only guarantees uniqueness within the RDF graph generated from one document. This is okay, as long as everyone is really careful when combining the results of parsing multiple documents. This is why RDF Semantics distinguishes between “graph union” and “graph merge”: A graph union is safe when all the blank nodes came from the same sequence. If they came from different sequences, then both sequences may have produced the “same” blank node, and hence we need to do an RDF merge and “standardize the blank nodes apart” before we can safely combine the graphs.

The current specs are sort of ok in this regard as long as we only talk about RDF graphs, because they clearly point out the difference between merge and union.

Once we go to g-boxes, persistence, and data structures that contain multiple graphs, I feel that the specs don't say enough to explain how an implementation needs to manage blank nodes in order to ensure interoperability.

The proposal I made earlier is essentially, “a graph store comes with its own built-in sequence, and all its blank nodes come from that sequence, and hence graph stores don't share blank nodes.”

Another way to improve the situation would be to say more clearly and generically what RDF 2004 already says: “When we talk about ‘fresh’ blank nodes in any RDF-related spec, then these ‘fresh’ nodes always come from some sort of blank node sequence. What sequence that is—a single global one, or multiple local ones–is implementation-dependent. However, if you want to ever hold blank nodes that came from different sequences in a single RDF graph, RDF dataset, or graph store, then you first need to standardize the blank nodes apart, that is, replace those from sequence B with fresh ones from sequence A so that all the blank nodes in the graph/dataset/store come from the same sequence. This ensures that we can safely say whether any two given blank nodes in the graph/dataset/store are the same or not.”

I sort of like this, because it describes what is already implemented, while not constraining the implementations, and providing some useful explanation of why we sometimes need to “standardize blank nodes apart”.

Best,
Richard

Received on Friday, 7 September 2012 12:53:40 UTC