Re: Blank nodes and SQL sequences (was: Re: Why blank nodes?) from Richard Cyganiak on 2012-09-07 (public-rdf-wg@w3.org from September 2012)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Fri, 7 Sep 2012 16:11:38 +0100
To: Steve Harris <steve.harris@garlik.com>
Cc: "public-rdf-wg@w3.org WG" <public-rdf-wg@w3.org>
Message-Id: <FF3E2313-6FA8-4AB2-887D-05D73D6D3F1B@cyganiak.de>
On 7 Sep 2012, at 14:55, Steve Harris wrote:
>> The proposal I made earlier is essentially, “a graph store comes with its own built-in sequence, and all its blank nodes come from that sequence, and hence graph stores don't share blank nodes.”
> 
> Which is what 4/5store do.
> 
> But, note that this makes it tricky to guarantee that bNodes can't appear in more than one graph, as there's nothing to tie and internal sequence ID to a particular graph. As far as the engine's concerned, it's just an ID that's never going to be reissued by the parser.
> 
> In particular, the way 4/5store define the default graph (union of named graphs) out of the box, means that every bNode appears in at least 2 graphs.

I'm not fundamentally opposed to sharing blank nodes between graphs. My problem is that RDF Concepts says: “Given two blank nodes, it is possible to determine whether or not they are the same”, without any further explanation. In a world where blank nodes can be persisted and passed around between graphs, I feel that this simple sentence is no longer sufficient to explain how blank node identity ought to be managed. If we can clarify what that sentence means, and clarify what implementations and other specifications (like R2RML and SPARQL Update) are expected to do in order to not fall afoul of that constraint, while allowing blank nodes to be shared between graphs, then I'm perfectly okay with allowing them to be shared.

>> Another way to improve the situation would be to say more clearly and generically what RDF 2004 already says: “When we talk about ‘fresh’ blank nodes in any RDF-related spec, then these ‘fresh’ nodes always come from some sort of blank node sequence. What sequence that is—a single global one, or multiple local ones–is implementation-dependent. However, if you want to ever hold blank nodes that came from different sequences in a single RDF graph, RDF dataset, or graph store, then you first need to standardize the blank nodes apart, that is, replace those from sequence B with fresh ones from sequence A so that all the blank nodes in the graph/dataset/store come from the same sequence. This ensures that we can safely say whether any two given blank nodes in the graph/dataset/store are the same or not.”
> 
> That feels to me a bit too much like specifying implementation details. I can imagine other schemes that are better for some specific purpose. 5store for e.g. issues its IDs in a /very/ specific order for performance and parallelism reasons (I'll write a paper about it someday), and I can imagine people wanting to use other schemes for similar esoteric reasons.

When I was talking about sequences I didn't necessarily mean it had to be 1, 2, 3. I meant that we have something that dispenses blank nodes (or internal identifiers for blank nodes, which is the same) in some order that guarantees that the same one never comes out a second time. A UUID generator makes a fine sequence, as long as the probability of clashes is sufficiently low for the user's application scenario.

My point in the paragraph above was about explaining formally when one needs to do standardize blank nodes apart, instead of simply forming set unions in the data structure. You need to standardize apart whenever the blank nodes involved could have come from different sequences. Again, whether there's one global sequence of multiple local ones is an implementation decision. In the case of 4store, since you have one sequence per graph store, you only need to standardize blank nodes apart if you wanted to import data from another graph store; in all other cases, the simple union is fine.

An implementation that uses UUIDs, which are (with a certain probability) globally unique, never needs to standardize blank nodes apart.

(I believe in Jena, API users can assign their own non-UUID labels, so the situation might be more complicated there.)

Best,
Richard


>> I sort of like this, because it describes what is already implemented, while not constraining the implementations, and providing some useful explanation of why we sometimes need to “standardize blank nodes apart”.
> 
> Well, my impression is that Jena uses something closer to a UUID.
> 
> - Steve
> 
> -- 
> Steve Harris, CTO
> Garlik, a part of Experian
> +44 7854 417 874  http://www.garlik.com/
> Registered in England and Wales 653331 VAT # 887 1335 93
> 80 Victoria Street, London, SW1E 5JL
> 
>
Received on Friday, 7 September 2012 15:12:27 UTC