Pragmatics of Blank Nodes Re: Toward easier RDF: a proposal from Henry Story on 2018-12-03 (semantic-web@w3.org from December 2018)

From: Henry Story <henry.story@bblfish.net>
Date: Mon, 3 Dec 2018 14:29:55 +0100
To: Hugh Glaser <hugh@glasers.org>
Cc: Anthony Moretti <anthony.moretti@gmail.com>, Thomas Passin <tpassin@tompassin.net>, Semantic Web <semantic-web@w3.org>
Message-Id: <23015FCF-589D-465D-85B0-82F48C4A4EBE@bblfish.net>
Following on Hugh's comment in this very helpful conversation on blank nodes
I then develop a little the pragmatic difference between them,  URIs with 
fragment identifiers and UUIDs.  

> On 3 Dec 2018, at 12:15, Hugh Glaser <hugh@glasers.org> wrote:
> 
> Hi Henry,
> Can we explore a bit of this some more please:
> 
>> On 3 Dec 2018, at 09:16, Henry Story <henry.story@bblfish.net> wrote:
>> 
>> 
>> 
>>> On 3 Dec 2018, at 00:03, Anthony Moretti <anthony.moretti@gmail.com> wrote:
>>> 
>>> That's interesting Henry, I tried to think more about the example you included in your message. Is what you're saying due to all properties having inverse properties?
>>> 
>>> So the example could also be written as:
>>> 
>>>    var a = {
>>>        type: Address,
>>>        streetAddress: "123 West Jefferson St",
>>>        addressOf: Joe
>>>    }
>>> 
>>>    var b = {
>>>        type: Address,
>>>        streetAddress: "123 West Jefferson St",
>>>        addressOf: Kate
>>>    }
>> 
>> That is one way of thinking of things.
>> 
>> RDF (without OWL) knows very little (nothing?) about which properties are essential to a
>> type. It is a graph or relation. What it knows is that two non identical URIs could refer
>> to different things, and it knows the identity criterion for a set of Literals (the xsd
>> ones).
>> 
>> So the policy is: if something could be different don't assume it is the same: the generic
>> RDF reasoner should not exclude a model of what the graph means.
>> 
>> With blank nodes the reasoner can come to a few more conclusions. Because blank nodes can
>> only appear in the graph in which they appear and not outside of it, all the information
>> about them is available from the graph. And so a blank node can only be distinguished by
>> how it relates to other nodes in the graph.  Furthermore the graph is immutable, so the
>> information that is there is all there is.
> All the examples I have seen seem to be very small, and in isolation, which is not usually possibly, because you are unlikely to get to the blank node without it being attached to something.
> The Joe/Kate as properties to the blank node has been mentioned.
> But in what sense in general is the graph immutable?
> [ a "PostalAddress”;
>      :streetAddress: “1 High St",
>      :addressLocality geo:london;
> ];
> Seems typical of our discussion.
> But what if I also later get a triple
> geo:london :inRegion geo:ontario .
> 
> If you are only talking about blank nodes that have only literals attached, that wouldn’t be very useful, I would have thought?
> (Especially as you can’t attach Joe or Kate to them.)


My reasoning was designed to work with links from blank nodes to literals or resources
identified by IRIs. There is a difference it is true. For literals identity criteria are
(usually) defined. For resources (objects) identified by URIs they are not a priori. Still
the RDF reasoning engine knowing that it does not know the identity criteria of any URIs
has to assume they may be different, so as not to accidentally discard a model.

(Note that applications using a vocabulary are free to make further inferences about
specific vocabularies they know about. So an application that knows about the earth size
and shape could make all kinds of inferences about geo coordinates that the simple RDF
engine cannot make, and so add relations between blank nodes or identify blank nodes that
RDF would not identify.  RDF sets only the outer limits of what can be concluded. It sets
the outer limits of the game of meaning that can be played with it.)

Perhaps it is clearer if one makes the graphs more explicit by naming them and by adding a
blank node to your second graph to make the role of blank nodes clearer. In N3 omitting
name space declarations

G1 = {
  _:g1b1 a :PostalAddress;
     :streetAddress “1 High St";
     :addressLocality geo:london.

  _:g1b2 a :PostalAddress;
      :streetAddress "1 High St" .
}

G2 = {

  geo:london :inRegion geo:ontario;
  :inCountry _:g2b1;
                _:g2b1 :name "Canada" .

  _:g2b2 a :PostalAddress;
     :addressLocality geo:London .
}

I have kept the blank node names separate to make merging easier, then using
https://www.w3.org/2000/10/swap/doc/CwmBuiltins

(G1 G2) log:conjunction {

    _:g1b1 a PostalAddress;
          :streetAddress “1 High St"; 
          :addressLocality geo:london.

    _:g1b2 a :PostalAddress;
      :streetAddress "1 High St" .

   geo:london :inRegion geo:ontario;
       :inCountry _:g2b1;
               _:g2b1 :name "Canada" .

   _:g2b2 a :PostalAddress;
         :addressLocality geo:London .
}

After the conjunction the node _:g1b1, _:g1b2, _:g2b1  and _:g2b2 do not have any more new
relations to any other literal or URI named resource than they had before. So the RDF
engine does not need to consider extra information. Indeed if I am correct it was already
taking into account that geo:london may have more information relating it to other things
which is why it does not collapse say geo:london and :PostalAddress. 

cwm could provide a log:lean relation relating a graph to it's lean version, in which
case we would have

G1 log:lean {
  _:g1b1 a :PostalAddress;
     :streetAddress “1 High St";
     :addressLocality geo:london.
}

and (G1 G2) log:conjunction [ log:lean { 
    _:g1b1 a PostalAddress;
          :streetAddress “1 High St"; 
          :addressLocality geo:london.

   geo:london :inRegion geo:ontario;
       :inCountry _:g2b1;
               _:g2b1 :name "Canada" .
}

I read reasonably carefully 3/4 of "Everything you always wanted to Know About Blank
Nodes" https://www.sciencedirect.com/science/article/pii/S1570826814000481 and as far as I
can tell from this on small graphs - the size of ones one is likely to need for many
everyday apps - there are efficient approximations to lean graphs.

So what are the advantages of blank nodes pragmatically? They make a description local to
the graph in which they appear and this locality is maintained across merges. The meaning
of URI referenced resoures can be completed by external information of course but the
description ensures that no further links need to be taken into account when understanding
the bnode's meaning. So it looks like it's ideal for things that need to be entirely
defined by description.

Close to blank nodes we have URIs with fragment identifiers (UFI) which are defined in
RFC3986 §3.5 https://tools.ietf.org/html/rfc3986#section-3.5 as getting their meaning from
the representation in which they appear. The use of these is that one knows where the
canonical definition of the node is to be found and they can be linked from other
resources. So these are useful for defining new terms. As opposed to blank nodes, if one
merges two graphs here, one can end up with new relations on the UFI, so that to find the
canonical definition one needs to go back to the original graph. Pragmatically this also
means that the owner of the resource, in so far as he is coining a new term is responsible
for the continuity of that term's definition across state changes of the resource, since
by changing the meaning of it, he can change the meaning of graphs using it. 

By comparison one can then see how UUIDs work. As UUIDs are not linked to a canonical
description as UFIs are, or are not guaranteed logically to maintain the same relation to
literals or URIs in a graph patterns as BNodes are, their meaning is up in the air. At
best someone with a full view of all resources on the web could tell where it had been
coined first. Such an architecture would not be feasible globally and not one that is
desireable either. There is no reason why first publication is determining for these. 
Perhaps UUIDs are useful then for emergent concepts where someone would like to work
with others on trying to understand something. One could only limit the meaning of it
by declaring one's view to be incompatible with that of others.





> 
> Best
> Hugh
>> 
>> If your graph says there is a white cat and there is a cat, then that will be true in all models where there is a white cat. So adding there is a cat, adds nothing. If you meant to say there are two cats, then you would need to add a relation between the blank nodes.

Hans wrote in reply to the above in a separate message which I am collecting back here:

> "You can solve this 'cat' problem by giving each cat a unique identifier and type each one as a 'cat'. We use UUIDs as identifiers."

Yes, that would stop an RDF engine merging those nodes. But unless there was other identifying information about each cat in the graph relating the UUID, it would be difficult for you yourself to know which cat was referenced by which UUID. If someone else then coined UUIDs for your cats too, then you would not be able to find out if you are double counting your cats either. And if someone maliciously published information using your UUIDs about other things, it would be quite difficult to tell who's definition was correct.

>> 
>> 
>>> 
>>> An implementation of ==() for Address would know which fields were relevant to the equality of Addresses and which were not, so it would ignore the addressOf field. It also shows that a check for structural equality is often too simple, ==() is often type dependent.
>> 
>> That's what OWL inverse functional properties and keys allow you to do: specify which "atttributes" determine the
>> identity of an object of a certain type. If the statement that the pair us:streetAddress and us:postcode formed a key
>> were written out in OWL and made easily available the OWL reasoner would then be able to deduce that Joe and 
>> Kate lived at the same address.
>> 
>>> 
>>> Because these are blank nodes we have access to all the properties that point to them and it would be possible to check for inverses prior to checking for equality I think.
>>> 
>>> I do see things through the lens of OO though, so it's entirely possible I'm missing a point being made, if so please let me know.
>> 
>> OWL is kind of a declarative Object Oriented Language in the sense that OO comes with a notion of classes and inheritance if that helps :-)
>> 
>>> 
>>> Anthony
>>> 
>>> On Sun, Dec 2, 2018 at 3:00 PM Thomas Passin <tpassin@tompassin.net> wrote:
>>> On 12/2/2018 4:31 PM, Henry Story wrote:
>>>> One could make a similar argument in Java, JavaScript or Scala. In
>>>> such languages the identity of objects can only be determined by
>>>> fields of the objects, that is arrows going *from* the object to a
>>>> literal or another object, as there is no way to determine what 
>>>> objects are pointing to an object O from within O: there is no global
>>>> object index.  So someone coming from that background would find it
>>>> odd that arrows pointing to a blank node could make a difference as
>>>> to the equality of that object.
>>> 
>>> I don't think that this is quite right.  The usual objects in OO
>>> languages know their parent *type* or (prototype).  They don't know
>>> about any linkages to other objects unless properties are assigned that
>>> hold that information.  And it's common to provide for such linkages.
>>> Just think of doubly linked lists, or the XML DOM.  True, there is 
>>> usually no built-in object index, but a programmer can provide for one 
>>> if it's wanted.
>>> 
>>> Anyway, garbage collectors need to find all the objects so they can do 
>>> their job, so in garbage-collected languages the information must be 
>>> available in some manner.
Received on Monday, 3 December 2018 13:30:22 UTC