Re: Pragmatics of Blank Nodes Re: Toward easier RDF: a proposal from David Booth on 2018-12-12 (semantic-web@w3.org from December 2018)

From: David Booth <david@dbooth.org>
Date: Wed, 12 Dec 2018 17:57:28 -0500
To: Aidan Hogan <aidhog@gmail.com>, semantic-web <semantic-web@w3.org>
Message-ID: <3c6e2d21-3506-6ece-3c53-0b0296906dfc@dbooth.org>
On 12/12/18 3:56 PM, Aidan Hogan wrote:
> On 04-12-2018 17:52, David Booth wrote:
>> On 12/4/18 3:30 PM, David Booth wrote:
>>> On 12/3/18 8:29 AM, Henry Story wrote:
>>>  > . . .  So what are the advantages of blank nodes
>>>  > pragmatically? They make a description local to the graph
>>>  > in which they appear and this locality is maintained
>>>  > across merges. The meaning of URI referenced resources can
>>>  > be completed by external information of course but the
>>>  > description ensures that no further links need to be taken
>>>  > into account when understanding the bnode's meaning. So it
>>>  > looks like it's ideal for things that need to be entirely
>>>  > defined by description.
>>>
>>> Interesting point!   That means that blank nodes enjoy a
>>> form of closed world assumption (CWA), in that there *cannot*
>>> be any other triples asserted (directly) about a blank node,
>>> other than the ones already in the document/graph/dataset
>>> at hand.  (Inference could add some though.)
> 
> In terms of where you mention "form of CWA", I get where you're coming 
> from but I think it's an orthogonal issue to CWA.
> 
> CWA says that the data that are not given (triples in our case) are 
> assumed to not be true. The local scoping of blank nodes means we cannot 
> make more statements about that particular blank node.

Agreed.  I guess it was misleading to use that term.

 >>> . . .
>>> This brings me to an interesting question.  To rephrase, the 
>>> "identity" of a blank node object is determined entirely by the 
>>> identities of its connected nodes, because it is guaranteed to not 
>>> have any other connections.  Therefore, a blank node labeling 
>>> algorithm (or standard Skolemization algorithm) only needs to take 
>>> into account the subgraph of that blank node's tightly connected 
>>> neighbors.  By "tightly connected" I mean the subgraph that is 
>>> connected only through consecutive blank nodes.  (I think this may be 
>>> slightly different from the Concise Bounded Description (CBD), 
>>> because the CBD starts only with the *subject* of a triple.)
>>> https://www.w3.org/Submission/CBD/
>>>
>>> Aiden (or someone else), is this correct?  If so, this would be very 
>>> beneficial, because the labeling algorithm could then be guaranteed 
>>> to generate the *same* label (or Skolem URI) for the blank nodes in 
>>> that subgraph, regardless of any larger graph in which that subgraph 
>>> appears.   This is very pertinent to n-ary relations, because it 
>>> means that blank nodes for the same n-ary relation, appearing in 
>>> different RDF graphs, could be automatically given the *same* label 
>>> (or Skolem URI) -- even without knowing a key for that object.  
>>> Aiden, is this what such canonicalization algorithms already do?
> 
> They can do that for sure, yes. But it's not the only way.
> 
> Thinking about '"identity" of a blank node' in terms of ids that should 
> be produced by a Skolemisation algorithm, it is certainly reasonable to 
> define such a notion of identity/identifier with respect to data only 
> about the connected blank nodes and their triples.

Excellent!   This is *very* helpful for the "diff" use case, because it 
helps to localize changes to canonical serializations.  This does not 
matter to a digital signature use case, but it matters a lot to the diff 
use case.

> 
> But it might also be reasonable to define identity in terms of the 
> entire graph, or some other way.
> 
> This is not just a philosophical issue, but has side-effects in practice 
> if we consider something like Skolemisation. (And it's not 100% clear 
> what notion of "identity" should be applied.)
> 
> ----------
> 
> Take for example a document with:
> 
>    [] foaf:name "John Smith" ; :child [ foaf:name "Jane Smith" ] .
>    [] foaf:name "John Smith" ; :child [ foaf:name "Jane Smith" ] .
> 
> ... or when expanded in NTriples:
> 
>      # connected blank nodes _:a1, _:a2
>    _:a1 foaf:name "John Smith" .
>    _:a1 :child _:a2 .
>    _:a2 foaf:name "Jane Smith" .
>      # connected blank nodes _:b1, _:b2
>    _:b1 foaf:name "John Smith" .
>    _:b1 :child _:b2 .
>    _:b2 foaf:name "Jane Smith" .
> 
> Now, if we define the skolem ids for blank nodes based just on the 
> connected blank nodes, we may arrive at, e.g.:
> 
>    :S1 foaf:name "John Smith" .
>    :S1 :child :S2 .
>    :S2 foaf:name "Jane Smith" .
> 
> Since the same skolem ids are generated for _:a1/_:b1 and _:a2/_:b2 (:S1 
> and :S2 resp.), this definition leads to duplicate ground triples, 
> meaning effectively we remove one pair of blank nodes.
> 
> This does not strike me as at all unreasonable, in that the document:
> 
>    [] foaf:name "John Smith" ; :child [ foaf:name "Jane Smith" ] .
>    [] foaf:name "John Smith" ; :child [ foaf:name "Jane Smith" ] .
> 
> simple-entails and is simple-entailed by a document:
> 
>    [] foaf:name "John Smith" ; :child [ foaf:name "Jane Smith" ] .
> 
> Semantically speaking, both documents say the same thing (that there 
> exists something with name John Smith who has some child with name Jane 
> Smith; saying it twice is semantically redundant).

Agreed.

> 
> However, one might argue that there is some pragmatics at play here, and 
> that the "duplicate" blank nodes might indicate on some level that the 
> publisher views these as different people (maybe the person keys cannot 
> be published), and that Skolemisation should not affect multiplicity, in 
> which case one might rather want to define the identifiers as:
> 
>    :SA1 foaf:name "John Smith" .
>    :SA1 :child :SA2 .
>    :SA2 foaf:name "Jane Smith" .
>    :SB1 foaf:name "John Smith" .
>    :SB1 :child :SB2 .
>    :SB2 foaf:name "Jane Smith" .
> 
> To produce distinct identifiers for the two groups of connected blank 
> nodes, this time we need to look at the rest of the graph to distinguish 
> any "duplicate" blank node identifiers that arise.

I agree, but since that interpretation of the graph is squarely at odds 
with the RDF semantics, my own opinion is that users should NOT attempt 
to use blank nodes that way: doing so is simply WRONG.  We should not 
endorse it just because some people may try to do it.  If users want to 
achieve that effect then they should use URIs instead of blank nodes.

> 
> --------
> 
> Finally consider a slightly more complex example
> 
>    [] foaf:name "John Smith" ; :child [ foaf:name "Jane Smith" ] .
>    [] foaf:name "John Smith" ; :child [ foaf:name "Jane Smith" ] .
>    [] foaf:name "John Smith" .
> 
> The first option, considering only connected blank nodes when generating 
> the Skolem IDs, this time leads to:
> 
>    :S1 foaf:name "John Smith" .
>    :S1 :child :S2 .
>    :S2 foaf:name "Jane Smith" .
>    :SC1 foaf:name "John Smith" .
> 
> The second option that preserves all blank nodes produces:
> 
>    :SA1 foaf:name "John Smith" .
>    :SA1 :child :SA2 .
>    :SA2 foaf:name "Jane Smith" .
>    :SB1 foaf:name "John Smith" .
>    :SB1 :child :SB2 .
>    :SB2 foaf:name "Jane Smith" .
>    :SC1 foaf:name "John Smith" .
> 
> A third option is to lean the graph before Skolemisation, which would 
> produce:
> 
>    :S1 foaf:name "John Smith" .
>    :S1 :child :S2 .
>    :S2 foaf:name "Jane Smith" .
> 
> The first option defines blank node identity in terms of the connected 
> blank nodes only. The second option defines blank node identity in terms 
> of the triples in the local graph. The third option defines blank node 
> identity in terms of the simple semantics of the local graph.
> 
> The blabel algorithm I wrote can be configured for any of the three 
> cases. Which case makes more sense I think will depend on the 
> application and maybe whom you ask. (If the input graphs are already 
> lean, the three cases will always coincide.)

Excellent explanation of the options!  My own opinions:

  - Option 1 is best.  If the user wishes to "lean" the graph or apply 
RDFS, OWL or other inferencing prior to canonicalization, then he/she is 
free to do so, prior to canonicalization, but it should not be required.

  - We should not support option 2, because that would be teaching 
people incorrect usage of RDF.

Thanks for your help!
David Booth

> 
>> P.S. this would also be very beneficial for the "diff" use case of RDF 
>> canonicalization, because it would help localize graph labeling 
>> differences.
> 
> Diff is a whole 'nother kettle of fish. :)
> 
> Cheers,
> Aidan
> 
> 
>
Received on Wednesday, 12 December 2018 22:57:51 UTC