Re: Pragmatics of Blank Nodes Re: Toward easier RDF: a proposal from Aidan Hogan on 2018-12-12 (semantic-web@w3.org from December 2018)

From: Aidan Hogan <aidhog@gmail.com>
Date: Wed, 12 Dec 2018 17:56:13 -0300
To: David Booth <david@dbooth.org>, semantic-web <semantic-web@w3.org>
Cc: Henry Story <henry.story@bblfish.net>
Message-ID: <d9d4443c-5b23-b184-7dc7-250f8a358bc0@gmail.com>
On 04-12-2018 17:52, David Booth wrote:
> On 12/4/18 3:30 PM, David Booth wrote:
>> On 12/3/18 8:29 AM, Henry Story wrote:
>>  > . . .  So what are the advantages of blank nodes
>>  > pragmatically? They make a description local to the graph
>>  > in which they appear and this locality is maintained
>>  > across merges. The meaning of URI referenced resources can
>>  > be completed by external information of course but the
>>  > description ensures that no further links need to be taken
>>  > into account when understanding the bnode's meaning. So it
>>  > looks like it's ideal for things that need to be entirely
>>  > defined by description.
>>
>> Interesting point!   That means that blank nodes enjoy a
>> form of closed world assumption (CWA), in that there *cannot*
>> be any other triples asserted (directly) about a blank node,
>> other than the ones already in the document/graph/dataset
>> at hand.  (Inference could add some though.)

In terms of where you mention "form of CWA", I get where you're coming 
from but I think it's an orthogonal issue to CWA.

CWA says that the data that are not given (triples in our case) are 
assumed to not be true. The local scoping of blank nodes means we cannot 
make more statements about that particular blank node.

Take for example a document that says:

   [] foaf:name "John Smith" .

This means (according to simple semantics at least) that there exists 
something called John Smith.

It does not mean that this thing called "John Smith" is not a member of 
the class named foaf:Person (which would be CWA since the triple is not 
given). Rather the local scoping says that we cannot state this 
membership externally using that particular blank node.

So these are two quite orthogonal issues: scoping is about what 
statements we can make about blank nodes in what "locations", OWA/CWA is 
about how missing statements are interpreted semantically. There is no 
justification in the standards to interpret statements that are missing 
(or that cannot be given due to blank node scoping issues) to be false.

>> Of course, if we are dealing with implicit blank nodes -- the ones 
>> generated by [] or () notation in Turtle -- then it's even more 
>> obvious that the only property connections to/from that blank node are 
>> the ones provided right there.
>>
>> This brings me to an interesting question.  To rephrase, the 
>> "identity" of a blank node object is determined entirely by the 
>> identities of its connected nodes, because it is guaranteed to not 
>> have any other connections.  Therefore, a blank node labeling 
>> algorithm (or standard Skolemization algorithm) only needs to take 
>> into account the subgraph of that blank node's tightly connected 
>> neighbors.  By "tightly connected" I mean the subgraph that is 
>> connected only through consecutive blank nodes.  (I think this may be 
>> slightly different from the Concise Bounded Description (CBD), because 
>> the CBD starts only with the *subject* of a triple.)
>> https://www.w3.org/Submission/CBD/
>>
>> Aiden (or someone else), is this correct?  If so, this would be very 
>> beneficial, because the labeling algorithm could then be guaranteed to 
>> generate the *same* label (or Skolem URI) for the blank nodes in that 
>> subgraph, regardless of any larger graph in which that subgraph 
>> appears.   This is very pertinent to n-ary relations, because it means 
>> that blank nodes for the same n-ary relation, appearing in different 
>> RDF graphs, could be automatically given the *same* label (or Skolem 
>> URI) -- even without knowing a key for that object.  Aiden, is this 
>> what such canonicalization algorithms already do?

They can do that for sure, yes. But it's not the only way.

Thinking about '"identity" of a blank node' in terms of ids that should 
be produced by a Skolemisation algorithm, it is certainly reasonable to 
define such a notion of identity/identifier with respect to data only 
about the connected blank nodes and their triples.

But it might also be reasonable to define identity in terms of the 
entire graph, or some other way.

This is not just a philosophical issue, but has side-effects in practice 
if we consider something like Skolemisation. (And it's not 100% clear 
what notion of "identity" should be applied.)

----------

Take for example a document with:

   [] foaf:name "John Smith" ; :child [ foaf:name "Jane Smith" ] .
   [] foaf:name "John Smith" ; :child [ foaf:name "Jane Smith" ] .

... or when expanded in NTriples:

     # connected blank nodes _:a1, _:a2
   _:a1 foaf:name "John Smith" .
   _:a1 :child _:a2 .
   _:a2 foaf:name "Jane Smith" .
     # connected blank nodes _:b1, _:b2
   _:b1 foaf:name "John Smith" .
   _:b1 :child _:b2 .
   _:b2 foaf:name "Jane Smith" .

Now, if we define the skolem ids for blank nodes based just on the 
connected blank nodes, we may arrive at, e.g.:

   :S1 foaf:name "John Smith" .
   :S1 :child :S2 .
   :S2 foaf:name "Jane Smith" .

Since the same skolem ids are generated for _:a1/_:b1 and _:a2/_:b2 (:S1 
and :S2 resp.), this definition leads to duplicate ground triples, 
meaning effectively we remove one pair of blank nodes.

This does not strike me as at all unreasonable, in that the document:

   [] foaf:name "John Smith" ; :child [ foaf:name "Jane Smith" ] .
   [] foaf:name "John Smith" ; :child [ foaf:name "Jane Smith" ] .

simple-entails and is simple-entailed by a document:

   [] foaf:name "John Smith" ; :child [ foaf:name "Jane Smith" ] .

Semantically speaking, both documents say the same thing (that there 
exists something with name John Smith who has some child with name Jane 
Smith; saying it twice is semantically redundant).

However, one might argue that there is some pragmatics at play here, and 
that the "duplicate" blank nodes might indicate on some level that the 
publisher views these as different people (maybe the person keys cannot 
be published), and that Skolemisation should not affect multiplicity, in 
which case one might rather want to define the identifiers as:

   :SA1 foaf:name "John Smith" .
   :SA1 :child :SA2 .
   :SA2 foaf:name "Jane Smith" .
   :SB1 foaf:name "John Smith" .
   :SB1 :child :SB2 .
   :SB2 foaf:name "Jane Smith" .

To produce distinct identifiers for the two groups of connected blank 
nodes, this time we need to look at the rest of the graph to distinguish 
any "duplicate" blank node identifiers that arise.

--------

Finally consider a slightly more complex example

   [] foaf:name "John Smith" ; :child [ foaf:name "Jane Smith" ] .
   [] foaf:name "John Smith" ; :child [ foaf:name "Jane Smith" ] .
   [] foaf:name "John Smith" .

The first option, considering only connected blank nodes when generating 
the Skolem IDs, this time leads to:

   :S1 foaf:name "John Smith" .
   :S1 :child :S2 .
   :S2 foaf:name "Jane Smith" .
   :SC1 foaf:name "John Smith" .

The second option that preserves all blank nodes produces:

   :SA1 foaf:name "John Smith" .
   :SA1 :child :SA2 .
   :SA2 foaf:name "Jane Smith" .
   :SB1 foaf:name "John Smith" .
   :SB1 :child :SB2 .
   :SB2 foaf:name "Jane Smith" .
   :SC1 foaf:name "John Smith" .

A third option is to lean the graph before Skolemisation, which would 
produce:

   :S1 foaf:name "John Smith" .
   :S1 :child :S2 .
   :S2 foaf:name "Jane Smith" .

The first option defines blank node identity in terms of the connected 
blank nodes only. The second option defines blank node identity in terms 
of the triples in the local graph. The third option defines blank node 
identity in terms of the simple semantics of the local graph.

The blabel algorithm I wrote can be configured for any of the three 
cases. Which case makes more sense I think will depend on the 
application and maybe whom you ask. (If the input graphs are already 
lean, the three cases will always coincide.)

> P.S. this would also be very beneficial for the "diff" use case of RDF 
> canonicalization, because it would help localize graph labeling 
> differences.

Diff is a whole 'nother kettle of fish. :)

Cheers,
Aidan
Received on Wednesday, 12 December 2018 20:56:36 UTC