Re: [Graphs] BNode scope in RDF Datasets proposal from Andy Seaborne on 2011-03-09 (public-rdf-wg@w3.org from March 2011)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Wed, 09 Mar 2011 09:07:17 +0000
To: Richard Cyganiak <richard@cyganiak.de>
CC: public-rdf-wg@w3.org
Message-ID: <4D774345.2010601@epimorphics.com>
On 09/03/11 08:02, Richard Cyganiak wrote:
> On 8 Mar 2011, at 18:43, Andy Seaborne wrote:
>> """ The same blank node cannot occur in two graphs at the same
>> time. """
>>
>> If there is knowledge it's the same blank node why not allow it to
>> be the same?  As long as the nodes aren't accidentally equated.
>
> I understand where you're coming from but am unconvinced.
>
> The reason for wanting blank node scoped to the graph is that they
> need to be scoped somehow, and the obvious alternative (scoping them
> to the dataset) just pushes the problem slightly further out -- you'd
> run into the same questions again in case you wanted to have multiple
> overlapping datasets that contain the same graphs (e.g., a public
> endpoint and an access-protected one). Scoping blank nodes to the
> graph has the nice property of making it possible to move graphs
> around between stores without anything surprising happening.
>
> I'm uncomfortable with the notion of some not further specified
> “knowledge” that the same blank node occurs in multiple places.
>
> How does this “cross-graph blank node knowledge” fit with SPARQL 1.0
> and SPARQL 1.1? Can I somehow construct graphs with such overlap
> given just the graph management features and update features found in
> those specs?
>
> Are there implementations that allow blank nodes to occur in multiple
> graphs, and if so, then how does the knowledge get into the store?

Yes.

>> As in the default-graph-as-union and the base+inference cases,
>> there are uses for the subgraph relationship and then it is the
>> same blank node.
>
> I don't see how it's relevant to default-graph-as-union, you can have
> that no matter how you scope the blank nodes. But it's true that we
> have use cases that perhaps require it *if* blank nodes occur in
> certain places in the data: “Slicing datasets according to multiple
> dimensions” and “Tracing inference results.”

If it's a union of graphs (not RDF merge), then it's the same blank 
node.  That's what set union gives you and is the effect of ignoring the 
4th column in a quad store (you have to ensure distinct-ness).

>
>> For TriG and N-quads, I suggest blank node labels are scoped to the
>> document, and across graphs.  It's confusing to see two _:a to mean
>> different things without much stronger scoping intuitions (esp.
>> N-Quads); it makes it possible to record when you do know they are
>> the same bnode (one graph a subgraph of another).
>
> Especially for N-Quads I would argue against this. We've found the
> ability to sensibly “merge” N-Quads files just by concatenating them,
> as well as other ad hoc string/line based operations, quite handy. If
> _:a in two different graphs means the same thing, then that's no
> longer possible, and we'd have to do the “standardize apart” dance.

You can't concatentate even N-Triples if you want to merge graphs. 
Concatenation is union.

> Also we use N-Quads a lot for storing results of web crawls, where
> the notion of a blank node shared between graphs is
> counter-intuitive, and where ensuring uniqueness of blank node labels
> across hundreds of millions of graphs would be expensive in various
> ways.

There are schemes like UUIDs that provide uniqueness without central 
authority.  A UUID is 16bytes, 128 bits.  The chances of even V4 UUIDs 
clashing (they are 122 bit random numbers) is so remote you should worry 
more about disasters hitting the data-centre and backup at the same time.

If you have access to a MAC address, and non-Byzantine software, V1 is 
even more robust and cheap to allocate.

>
> This whole discussion just shows again what a bloody pain blank nodes
> are. I guess my position is that “blank nodes should have less
> magic”, and blank nodes shared between graphs in a dataset under
> certain circumstances just adds more magic that will trip people over
> and cause headaches for future users and implementers (and spec
> writers). When you're down a hole, the first thing to do is stop
> digging.

Respectfully, I disagree.  This is the simple route.  Once parsed and 
carefully kept apart, bNodes can be treated as things - later inference 
or application can decide whether to smush, lean or whatever.

	Andy

>
> Best, Richard
>
>
>>
>> Andy
>>
>
Received on Wednesday, 9 March 2011 09:07:55 UTC