Re: [Graphs] BNode scope in RDF Datasets proposal from Steve Harris on 2011-03-09 (public-rdf-wg@w3.org from March 2011)

From: Steve Harris <steve.harris@garlik.com>
Date: Wed, 9 Mar 2011 09:58:00 +0000
To: Richard Cyganiak <richard@cyganiak.de>
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, public-rdf-wg@w3.org
Message-Id: <ADA55FAB-E02E-40EF-93E1-35D3CABF5886@garlik.com>
On 2011-03-09, at 08:02, Richard Cyganiak wrote:

> On 8 Mar 2011, at 18:43, Andy Seaborne wrote:
>> """
>> The same blank node cannot occur in two graphs at the same time.
>> """
>> 
>> If there is knowledge it's the same blank node why not allow it to be the same?  As long as the nodes aren't accidentally equated.
> 
> I understand where you're coming from but am unconvinced.
> 
> The reason for wanting blank node scoped to the graph is that they need to be scoped somehow, and the obvious alternative (scoping them to the dataset) just pushes the problem slightly further out -- you'd run into the same questions again in case you wanted to have multiple overlapping datasets that contain the same graphs (e.g., a public endpoint and an access-protected one). Scoping blank nodes to the graph has the nice property of making it possible to move graphs around between stores without anything surprising happening.
> 
> I'm uncomfortable with the notion of some not further specified “knowledge” that the same blank node occurs in multiple places.
> 
> How does this “cross-graph blank node knowledge” fit with SPARQL 1.0 and SPARQL 1.1? Can I somehow construct graphs with such overlap given just the graph management features and update features found in those specs?

The place I've run into this issue is SPARQL Update:

INSERT {
  GRAPH <G2> { ?x :p ?z }
}
WHERE {
  GRAPH <G1> { ?x :p ?z }
}

So, the question is, should the bNodes in G2, but the "same" as the ones in G1. [ FWIW, currently in 4store we mint new bNodes, on inserting them into G2, but with the same graph shape, i.e. a consistent bNode -> bNode mapping for those graphs. This may or may not be what's desired by the user. ]

However if you later to the same update, but with :q instead of :p it gets a bit murkier. 4store has forgotten the mapping from bNodes in G1 to G2 by then, but the user might reasonably expect :q of the graph structure relating to the same bNodes to be copied, but at this point it's impossible, unless you maintain the mapping forever, or reuse the same (skolem form for the) bNodes.

> Are there implementations that allow blank nodes to occur in multiple graphs, and if so, then how does the knowledge get into the store?

Well, the store is indexed, so you can just ask:

SELECT ?bnode
WHERE {
  GRAPH ?g1 { ?bnode ?p ?o }
  GRAPH ?g2 { ?bnode ?p ?o }
  FILTER(ISBLANK(?bnode))
  FILTER(?g1 != ?g2)
}

etc.

>> As in the default-graph-as-union and the base+inference cases, there are uses for the subgraph relationship and then it is the same blank node.
> 
> I don't see how it's relevant to default-graph-as-union, you can have that no matter how you scope the blank nodes. But it's true that we have use cases that perhaps require it *if* blank nodes occur in certain places in the data: “Slicing datasets according to multiple dimensions” and “Tracing inference results.”

With default-graph-as-union the same bNode will appear in a named graph and the default graph. i.e.

SELECT ?bnode
WHERE {
  ?bnode ?p ?o .
  GRAPH ?g {
    ?bnode ?p ?o
  }
  FILTER(ISBLANK(?bnode))
}

Will return every bNode that appears as a subject.

>> For TriG and N-quads, I suggest blank node labels are scoped to the document, and across graphs.  It's confusing to see two _:a to mean different things without much stronger scoping intuitions (esp. N-Quads); it makes it possible to record when you do know they are the same bnode (one graph a subgraph of another).
> 
> Especially for N-Quads I would argue against this. We've found the ability to sensibly “merge” N-Quads files just by concatenating them, as well as other ad hoc string/line based operations, quite handy. If _:a in two different graphs means the same thing, then that's no longer possible, and we'd have to do the “standardize apart” dance. Also we use N-Quads a lot for storing results of web crawls, where the notion of a blank node shared between graphs is counter-intuitive, and where ensuring uniqueness of blank node labels across hundreds of millions of graphs would be expensive in various ways.
> 
> This whole discussion just shows again what a bloody pain blank nodes are. I guess my position is that “blank nodes should have less magic”, and blank nodes shared between graphs in a dataset under certain circumstances just adds more magic that will trip people over and cause headaches for future users and implementers (and spec writers). When you're down a hole, the first thing to do is stop digging.

I both agree, and disagree :)

Being able to concatenate N-Quads is indeed useful, and I probably wouldn't want to lose that, but I think bNode scope is one of the things that trips up users. My impression is that people who're used to AUTO_INCREMENT columns in RDBMSs expect them to behave more like those, and they can be moved between tables and databases, as they're just integers, with no additional semantics.

I'm probably used to a different kind of user though.

- Steve

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Wednesday, 9 March 2011 09:58:36 UTC