Re: Blank nodes and SQL sequences

On 7 Sep 2012, at 14:50, Sandro Hawke wrote:
>>> Taking a step back, and thinking about what we (Experian) actually use bNodes for, to inform our position on the various scope questions.
>>> 
>>> Basically, it's just a replacement for auto_increment columns in SQL.
>> 
>> I'd like to take this metaphor a bit further.
>> 
>> AUTO_INCREMENT is a MySQL-specific feature. It's a way of getting guaranteed unique identifiers within the scope of a MySQL table.
>> 
>> The standard SQL equivalent is the SEQUENCE. It's not bound to a specific table, but needs to be created explicitly:
>> 
>>   CREATE SEQUENCE customer_seq INCREMENT BY 1 START WITH 1;
>> 
>> Then when I want to insert a new row into my customer table, I can grab a “fresh” value from the sequence using the expression customer_seq.NEXTVAL:
>> 
>>   INSERT INTO customer (cust_id, name, address)
>>   VALUES (customer_seq.NEXTVAL, 'John Doe', '123 Main St.');
>> 
>> A SEQUENCE guarantees that successive calls to NEXTVAL will return different values. AUTO_INCREMENT is just like a SEQUENCE that's tightly bound to the table.
>> 
>> 
>> So this is a lot like blank nodes in RDF, if we ignore concrete syntaxes and the semantics, and just look at the data model. RDF Concepts says:
>> 
>> [[
>> The blank nodes in an RDF graph are drawn from an infinite set. This set is disjoint from the set of all IRIs and the set of all literals. Otherwise, this set of blank nodes is arbitrary. Given two blank nodes, it is possible to determine whether or not they are the same. Besides that, RDF makes no reference to any internal structure of blank nodes.
>> ]]
>> 
>> So, when we talk about “allocating a fresh blank node”, we really pull a NEXTVAL from this infinite sequence of blank nodes.
>> 
>> The thing is, RDF Concepts doesn't say what the “scope” of this “sequence” is. The “sequence” is not bound to one “table” like in MySQL. It's not explicitly created and explicitly referenced like in vanilla SQL. It's all sort of left to implementations.
>> 
>> Jena, I believe, assumes “one big universal sequence of blank nodes in the sky”, and the uniqueness of values within the sequence is only stochastically guaranteed.
>> 
>> In other implementations, the “sequence” is managed by the RDF parser, and only guarantees uniqueness within the RDF graph generated from one document. This is okay, as long as everyone is really careful when combining the results of parsing multiple documents. This is why RDF Semantics distinguishes between “graph union” and “graph merge”: A graph union is safe when all the blank nodes came from the same sequence. If they came from different sequences, then both sequences may have produced the “same” blank node, and hence we need to do an RDF merge and “standardize the blank nodes apart” before we can safely combine the graphs.
>> 
>> The current specs are sort of ok in this regard as long as we only talk about RDF graphs, because they clearly point out the difference between merge and union.
>> 
>> Once we go to g-boxes, persistence, and data structures that contain multiple graphs, I feel that the specs don't say enough to explain how an implementation needs to manage blank nodes in order to ensure interoperability.
>> 
>> The proposal I made earlier is essentially, “a graph store comes with its own built-in sequence, and all its blank nodes come from that sequence, and hence graph stores don't share blank nodes.”
>> 
>> Another way to improve the situation would be to say more clearly and generically what RDF 2004 already says: “When we talk about ‘fresh’ blank nodes in any RDF-related spec, then these ‘fresh’ nodes always come from some sort of blank node sequence. What sequence that is—a single global one, or multiple local ones–is implementation-dependent. However, if you want to ever hold blank nodes that came from different sequences in a single RDF graph, RDF dataset, or graph store, then you first need to standardize the blank nodes apart, that is, replace those from sequence B with fresh ones from sequence A so that all the blank nodes in the graph/dataset/store come from the same sequence. This ensures that we can safely say whether any two given blank nodes in the graph/dataset/store are the same or not.”
>> 
>> I sort of like this, because it describes what is already implemented, while not constraining the implementations, and providing some useful explanation of why we sometimes need to “standardize blank nodes apart”.
> 
> Yeah, I think that works.
> 
> When you say, "if you want to ever hold blank nodes that came from different sequences in a single RDF graph, RDF dataset, or graph store, then you first need to standardize the blank nodes apart" -- I think that's right.    Systems that use the UUID algorithm for generating fresh blank node IDs are all using the same "sequence", so they can skip the standardize-apart step.

Yes.

> I'm not quite sure about the terminology -- the "sequence" is more of namespace than a sequence,

Right now, I like the term “sequence” because of the analogy to SQL, and because it helps explaining what we mean by “fresh” blank nodes (the next one from the sequence).

“Namespace” is confusing because RDF Concepts insists that blank nodes don't necessarily have names.

> and are they really blank nodes that are coming out of that sequence, or are they blank node internal identifiers?    

RDF Concepts again:

[[
The blank nodes in an RDF graph are drawn from an infinite set. This set is disjoint from the set of all IRIs and the set of all literals. Otherwise, this set of blank nodes is arbitrary.
]]

Let's say my internal blank node identifiers are UUIDs. UUIDs *are* disjoint from IRIs (different syntax) and *are* disjoint from literals (they are not <lexical form, datatype IRI> pairs). So, I can actually say that the UUIDs *are* my blank nodes. Not just an internal identifier for a node, but they actually *are* the nodes in the RDF graph. The thing is, RDF Concepts doesn't care whether they are UUIDs, natural numbers, Java objects, or anything else. All that matters is that we can tell one member of the set from the others. “Otherwise, this set of blank nodes is arbitrary.”

> It would be kind of nice to call them "private genids", to parallel public genids and IRIs, but I guess that's too far from tradition.

Well, if you want parallel terms then the less invasive thing would be to rename ./well-known/genid/ to ./well-known/bnode/ , and to talk about “private blank nodes” and “public blank nodes”. Not that this is a particularly good idea IMO.

I agree that if we make some changes to RDF Concepts along the lines I've suggested in this thread, then some wording in the Skolem IRI section can probably be improved.

> But whatever, I can live with the terminology you're using above.

I'll draft some proposal text in the next couple of days.

Best,
Richard


> 
>    -- Sandro
> 
> 

Received on Saturday, 8 September 2012 19:34:01 UTC