Re: Blank nodes and SQL sequences (was: Re: Why blank nodes?)

On 2012-09-07, at 13:53, Richard Cyganiak wrote:

> Steve,
> 
> On 7 Sep 2012, at 11:35, Steve Harris wrote:
>> Taking a step back, and thinking about what we (Experian) actually use bNodes for, to inform our position on the various scope questions.
>> 
>> Basically, it's just a replacement for auto_increment columns in SQL.
> 
> I'd like to take this metaphor a bit further.
> 
> AUTO_INCREMENT is a MySQL-specific feature. It's a way of getting guaranteed unique identifiers within the scope of a MySQL table.

Good point (it's also in SQLite, and I think Postgres, but yeah, it's not standard SQL). It's been so long since I used anything else that I'd forgotten that.

> The standard SQL equivalent is the SEQUENCE. It's not bound to a specific table, but needs to be created explicitly:
> 
>  CREATE SEQUENCE customer_seq INCREMENT BY 1 START WITH 1;
> 
> Then when I want to insert a new row into my customer table, I can grab a “fresh” value from the sequence using the expression customer_seq.NEXTVAL:
> 
>  INSERT INTO customer (cust_id, name, address)
>  VALUES (customer_seq.NEXTVAL, 'John Doe', '123 Main St.');
> 
> A SEQUENCE guarantees that successive calls to NEXTVAL will return different values. AUTO_INCREMENT is just like a SEQUENCE that's tightly bound to the table.
> 
> 
> So this is a lot like blank nodes in RDF, if we ignore concrete syntaxes and the semantics, and just look at the data model. RDF Concepts says:
> 
> [[
> The blank nodes in an RDF graph are drawn from an infinite set. This set is disjoint from the set of all IRIs and the set of all literals. Otherwise, this set of blank nodes is arbitrary. Given two blank nodes, it is possible to determine whether or not they are the same. Besides that, RDF makes no reference to any internal structure of blank nodes.
> ]]
> 
> So, when we talk about “allocating a fresh blank node”, we really pull a NEXTVAL from this infinite sequence of blank nodes.

And that's exactly what 4store and 5store do. There's a Sequence for each DB/KB/store/dataset. Whatever the word de jour is. 

> The thing is, RDF Concepts doesn't say what the “scope” of this “sequence” is. The “sequence” is not bound to one “table” like in MySQL. It's not explicitly created and explicitly referenced like in vanilla SQL. It's all sort of left to implementations.

Right, we're really delving more into implementation details here.

> Jena, I believe, assumes “one big universal sequence of blank nodes in the sky”, and the uniqueness of values within the sequence is only stochastically guaranteed.
> 
> In other implementations, the “sequence” is managed by the RDF parser, and only guarantees uniqueness within the RDF graph generated from one document. This is okay, as long as everyone is really careful when combining the results of parsing multiple documents. This is why RDF Semantics distinguishes between “graph union” and “graph merge”: A graph union is safe when all the blank nodes came from the same sequence. If they came from different sequences, then both sequences may have produced the “same” blank node, and hence we need to do an RDF merge and “standardize the blank nodes apart” before we can safely combine the graphs.
> 
> The current specs are sort of ok in this regard as long as we only talk about RDF graphs, because they clearly point out the difference between merge and union.
> 
> Once we go to g-boxes, persistence, and data structures that contain multiple graphs, I feel that the specs don't say enough to explain how an implementation needs to manage blank nodes in order to ensure interoperability.

Agreed.

> The proposal I made earlier is essentially, “a graph store comes with its own built-in sequence, and all its blank nodes come from that sequence, and hence graph stores don't share blank nodes.”

Which is what 4/5store do.

But, note that this makes it tricky to guarantee that bNodes can't appear in more than one graph, as there's nothing to tie and internal sequence ID to a particular graph. As far as the engine's concerned, it's just an ID that's never going to be reissued by the parser.

In particular, the way 4/5store define the default graph (union of named graphs) out of the box, means that every bNode appears in at least 2 graphs.

> Another way to improve the situation would be to say more clearly and generically what RDF 2004 already says: “When we talk about ‘fresh’ blank nodes in any RDF-related spec, then these ‘fresh’ nodes always come from some sort of blank node sequence. What sequence that is—a single global one, or multiple local ones–is implementation-dependent. However, if you want to ever hold blank nodes that came from different sequences in a single RDF graph, RDF dataset, or graph store, then you first need to standardize the blank nodes apart, that is, replace those from sequence B with fresh ones from sequence A so that all the blank nodes in the graph/dataset/store come from the same sequence. This ensures that we can safely say whether any two given blank nodes in the graph/dataset/store are the same or not.”

That feels to me a bit too much like specifying implementation details. I can imagine other schemes that are better for some specific purpose. 5store for e.g. issues its IDs in a /very/ specific order for performance and parallelism reasons (I'll write a paper about it someday), and I can imagine people wanting to use other schemes for similar esoteric reasons.

> I sort of like this, because it describes what is already implemented, while not constraining the implementations, and providing some useful explanation of why we sometimes need to “standardize blank nodes apart”.

Well, my impression is that Jena uses something closer to a UUID.

- Steve

-- 
Steve Harris, CTO
Garlik, a part of Experian
+44 7854 417 874  http://www.garlik.com/
Registered in England and Wales 653331 VAT # 887 1335 93
80 Victoria Street, London, SW1E 5JL

Received on Friday, 7 September 2012 13:56:00 UTC