Re: Pragmatics of Blank Nodes Re: Toward easier RDF: a proposal from Andy Seaborne on 2018-12-10 (semantic-web@w3.org from December 2018)

From: Andy Seaborne <andy@seaborne.org>
Date: Mon, 10 Dec 2018 11:29:25 +0000
To: Henry Story <henry.story@bblfish.net>
Cc: Semantic Web <semantic-web@w3.org>
Message-ID: <3463f989-64d2-46a2-9e9c-0c49f0f39a83@seaborne.org>
On 06/12/2018 10:53, Henry Story wrote:
> 
> 
>> On 5 Dec 2018, at 19:28, Andy Seaborne <andy@seaborne.org> wrote:
>>
>>
>>
>> On 05/12/2018 04:13, Patrick J Hayes wrote:
>>>> On Dec 4, 2018, at 9:55 PM, David Booth <david@dbooth.org> wrote:
>>>>
>>>> Hi Pat,
>>>>
>>>> On 12/4/18 7:31 PM, Patrick J Hayes wrote:
>>>>>> On Dec 4, 2018, at 2:30 PM, David Booth <david@dbooth.org> wrote:
>>>>>>
>>>>>> On 12/3/18 8:29 AM, Henry Story wrote:
>>>>>>> . . .  So what are the advantages of blank nodes
>>>>>>> pragmatically? They make a description local to the graph
>>>>>>> in which they appear and this locality is maintained
>>>>>>> across merges. The meaning of URI referenced resources can
>>>>>>> be completed by external information of course but the
>>>>>>> description ensures that no further links need to be taken
>>>>>>> into account when understanding the bnode's meaning. So it
>>>>>>> looks like it's ideal for things that need to be entirely
>>>>>>> defined by description.
>>>>> OR that cannot be *defined* at all, which is closer to the
>>>>> original idea. Henry, why would you assume that everything
>>>>> that can be mentioned, can also be /defined/?
>>>>>>
>>>>>> Interesting point!   That means that blank nodes enjoy a
>>>>>> form of closed world assumption (CWA),
>>>>>
>>>>> No. That is exactly the kind of mistake that one gets into
>>>>> by being too loose with words like 'define'.
>>>>>
>>>>>> in that there *cannot* be any other triples asserted
>>>>>> (directly) about a blank node, other than the ones already
>>>>>> in the document/graph/dataset at hand.  (Inference could
>>>>>> add some though.)
>>>>>
>>>>> Yes, it certainly could, if one has access to something
>>>>> like OWL.
>>>>>>
>>>>>> Of course, if we are dealing with implicit blank nodes --
>>>>>> the ones generated by [] or () notation in Turtle -- then
>>>>>> it's even more obvious that the only property connections
>>>>>> to/from that blank node are the ones provided right there
>>>>>
>>>>> Inference can add extra triples to those also.
>>>>
>>>> Yes, of course.
>>>>
>>>>> Suppose for example you know that the property rdf:rest is funcitonal and you know that x:A rdf:rest _:x ., and someone
>>>>> tells you that
>>>>> x:A rdf:rest _:y .
>>>>> _:y x:Q x:C .
>>>>> then you know know that  _:x owl:sameAs _:y ., and hence that _:x x:Q x:C .
>>>>> Now, someone might argue that such cases are vanishingly rare, or even that they shouldn’t be allowed or encouraged, but that would be a different argument.
>>>>>>
>>>>>> This brings me to an interesting question.  To rephrase, the "identity" of a blank node object is determined entirely by the identities of its connected nodes, because it is guaranteed to not have any other connections.
>>>>> It isn't, if we allow inferences.
>>>>
>>>> Certainly we must allow inferences.  However, the results of inference constitute a different graph: the original graph + the entailments.
>>>>
>>>> I put "identity" in quotes above because what I mean is the identify of that node *within* the graph, i.e., a name that allows us to distinguish that node from other nodes in the graph.  I am *not* referring to "all information known/knowable about that node", or "the properties of the node", or any other grand notion of identity like that.  I am talking about identity in the context of blank node labeling, in which the goal is to have a standard algorithm for labeling each blank node.
>>>>
>>>>>> Therefore, a blank node labeling algorithm (or standard
>>>>>> Skolemization algorithm) only needs to take into account the
>>>>>> subgraph of that blank node's tightly connected neighbors.
>>>>>> By "tightly connected" I mean the subgraph that is connected
>>>>>> only through consecutive blank nodes.  (I think this may
>>>>>> be slightly different from the Concise Bounded Description
>>>>>> (CBD), because the CBD starts only with the *subject*
>>>>>> of a triple.)  https://www.w3.org/Submission/CBD/
>>>>>> Aiden (or someone else), is this correct?  If so, this would
>>>>>> be very beneficial, because the labeling algorithm could
>>>>>> then be guaranteed to generate the *same* label (or Skolem
>>>>>> URI) for the blank nodes in that subgraph, regardless of any
>>>>>> larger graph in which that subgraph appears.  This is very
>>>>>> pertinent to n-ary relations, because it means that blank
>>>>>> nodes for the same n-ary relation, appearing in different
>>>>>> RDF graphs, could be automatically given the *same* label (or
>>>>>> Skolem URI) -- even without knowing a key for that object.
>>>>> That would be a wildly invalid conclusion. The coding of an n-ary atomic sentence into binary RDF basically says
>>>>> that an 'event' (or a 'fact', or 'situation', or)  exists
>>>>> which represents the fact of the relation holding between
>>>>> the participants. So my hitting a wall with a hammer (a
>>>>> three-place relation) might be encoded as a bnode of type
>>>>> hitting with an agent being me and an object being the wall
>>>>> and the means being the hammer. But there might be a whole
>>>>> lot of hits of that wall with that hammer by me. You can't
>>>>> infer that the many bnodes which encode various assertions
>>>>> of this kind are all the same single entity with a single
>>>>> global identifier: for one thing, that would imply that I
>>>>> only hit the wall once.
>>>>
>>>> No, it would imply that you hit the wall at *least* once.
>>>> Asserting the same thing multiple times does *not* imply
>>>> that it happened more than once.  It is logically equivalent
>>>> to asserting it once, right?  So if these two statement groups
>>>> appear in a graph:
>>>>
>>>>   [ a :Hit ; :by :hammer ; :agent :pat ; :target :wall ] .
>>>>   [ a :Hit ; :by :hammer ; :agent :pat ; :target :wall ] .
>>>>
>>>> then they are logically equivalent to a single (lean) statement group:
>>>>
>>>>   [ a :Hit ; :by :hammer ; :agent :pat ; :target :wall ] .
>>>>
>>>> and hence they can share the same blank node.  Correct?
>>
>> lean graphs are all very well until update happens.  New information arrives that breaks the equivalence.
>>
>> For an "easier RDF", talking about how the graph is built seems quite natural.
>>
>> Leaning has a place at the point of publishing (maybe).
> 
> Why could not the RDF library implement bondes as a triple
> 
>     type BNode = GraphID × LocalNodeId × Lean

I don't quite understand the idea here.  Lean changes as the graph is 
updates.

> 
> which could of course be done efficiently with
> 
>     type GraphId=Long
>     type LocalNodeId=Int or Long
>     type Lean=Boolean
> 
> where Lean would be a flag that the node was calculated as lean as
> described as I understand it by the algorithms detailed in
> "Everything you always wanted to know about blank nodes"
> https://www.sciencedirect.com/science/article/pii/S1570826814000481
> 
> ?

That is for a static graph - both for update and for parsing, a system 
wants to deal with the current triple and move on - anything that 
accumulates state is a barrier to scale. BNodes internal system ids are 
created as the parser encounters them.

(that said, I don't recall coming across a graph where making it lean 
was a requirement. Leaning alters SPARQL results.)

FYI:
Explicit labels: Jena allocates UUID (i.e. two longs) at the state of a 
parser run and XORs the label, as a integer, with that number to have a 
stateless but label tracking allocation scheme.  Other schemes are 
available, for example, retain the written label (for debugging), but 
that is the scaling default. This is based on the previous experience 
with tracking labels and then people occasionally running out of heap - 
works in testing, fails at scale.

Implicit labels : allocate a UUID (not a URI) as the internal id.  There 
is no need to have local system ids when a global scheme is practical.

     Andy

> 
>>
>>> Yes, you are absolutely right. And I was wrong, above. (I bow graciously and remove my hat.)  Though if you have both copies, which is what ‘share’ suggests, then your graph is still non-lean. It would be better to just keep one copy, and have a lean graph.
>>>>   And if that blank node is Skolemized, then they can share the same Skolem URI.  Correct?
>>> Yes, with same comment about ’share’.
>>> Pat
>>>>
>>>> David Booth
>
Received on Monday, 10 December 2018 11:29:51 UTC