Re: Blank Nodes Re: Toward easier RDF: a proposal from Henry Story on 2018-11-28 (semantic-web@w3.org from November 2018)

From: Henry Story <henry.story@bblfish.net>
Date: Wed, 28 Nov 2018 16:41:12 +0100
To: David Booth <david@dbooth.org>
Cc: semantic-web@w3.org
Message-Id: <DCEEC71C-094A-482D-96E4-88A128504619@bblfish.net>
> On 28 Nov 2018, at 05:16, David Booth <david@dbooth.org> wrote:
> 
> On 11/27/18 10:47 PM, Thomas Passin wrote:
>> On 11/27/2018 10:01 PM, David Booth wrote:
>>> On 11/27/18 2:04 PM, Nathan Rixham wrote:
>>> . . .
>>>> Here's an extract:
>>>> {
>>>>     ...
>>>>    "name": "County Assessor's Office",
>>>>    "address": {
>>>>      "@type": "PostalAddress",
>>>>      "streetAddress": "123 West Jefferson Street",
>>>>      "addressLocality": "Phoenix",
>>>>      "addressRegion": "AZ",
>>>>      "postalCode": "85003",
>>>>      "addressCountry": "US"
>>>>    },
>>>>    "geo": {
>>>>      "@type": "GeoCoordinates",
>>>>      "latitude": 33.4466,
>>>>      "longitude": -112.07837  },
>>>> }
>>>> . . .
>>>> [To] have the same address or geo coordinates published on tens of thousands of different websites, all using a different ID (uri) would be a huge, horrible, mess.
>>> 
>>> Not so fast.  Two points:
>>> 
>>>   - Unless you make a unique name assumption with URIs, that huge, horrible mess is pretty much the situation we already have using blank nodes.  Except that in some ways the current situation is *worse*, because the same data loaded twice cause duplicate triples (non-lean), whereas that would be automatically avoided if URIs were usesd.
>> But the key point here is that they might or might not be duplicates. And the types and predicates (the semantics of table and column names, if you get right down to it, since a lot of linked data comes from relational databases) might or might not be the same.  There has to be some way to get decent assurances that they *are* the same, before the graphs get merged.  Tinkering with the RDF specs, and having ways to canonically name blank nodes, won't handle this problem. It's a data and semantics problem instead.

That is mostly because there are no globally clear ways of identifying two identical addresses. That could indeed
be solved by having a URN scheme globally understood for addresses that would be easy to coin - a kind of more
powerful postcode. 

A person looking at the Json sees the same address because they think of of a number of things:
  1. that "address" is a functional property. 
    That is usually what OO programmers tend to think since all their attributes are unique. If they then want more than one address they need to move to having a relation to a set. Then they have the same problem since they would need to work out when two addresses are the same.
  2.  they could decide that two addresses are the same if they have exactly the same attributes and values.


 I wonder if in that case 2. there is not a recipe for when one can merge such blank nodes. I don't think that semantically
the following two graphs G1 and G2 below are different:

G1 =  {
:joe :address [ a "PostalAddress";
     :streetAddress "123 West Jefferson Street";
     :addressLocality "Phoenix";
     :addressRegion "AZ";
     :postalCode "85003";
     :addressCountry "US";
   ];
   :address [ a "PostalAddress";
     :streetAddress "123 West Jefferson Street";
     :addressLocality "Phoenix";
     :addressRegion "AZ";
     :postalCode "85003";
     :addressCountry "US";
   ].
}


G2 = {
:joe :address [ a "PostalAddress";
     :streetAddress "123 West Jefferson Street";
     :addressLocality "Phoenix";
     :addressRegion  "AZ";
     :postalCode "85003";
     :addressCountry "US";
   ].
}

That would be satisfied in all the same models, it seems to me. No?

Take a model Model1 where Joe has 1 flat at that address then

Model1 ⊨ G1 and Model1 ⊨ G2 

since 1 address will satisfy both blank nodes.

At the same time if in Model2 Joe has two flats at that same address we also have

Model1 ⊨ G1 and Model1 ⊨ G2 

since any of the flats will satisfy the address of each blank node.


> 
> Perhaps we are talking about different things.  To my mind, if the above example appears in two different datasets, with the same @context so that the same triples are generated (except for blank node labels), then they *are* duplicates and they *do* mean the same thing.  And if the same predictable URI is generated instead of a blank node each time the JSON-LD uses curly braces {}, then to my mind that would be a *good* thing, because those bits of RDF, even though that come from different sources, *are* the same address and *do* mean the same thing.
> 
> At least, that's how I look at it.  Please explain further if I've misunderstood your point.
> 
> > On top of that, many of these data graphs that one wants
> > to merge won't be either isomorphic to each other, or be
> > subsets or supersets.  In that situation, I don't see how
> > a blank node identifying algorithm that has to traverse
> > and consider the whole graph can spit out identifiers that
> > will make corresponding blank nodes in the various graphs
> > reliably have to the same identifier.  That's the kind of
> > algorithm that Aidan Hogan's papers talk about, isn't it,
> > ones that consider the entire graph?
> 
> It has to consider more of the graph *if* blank node cycles are permitted, and that is what Aiden's algorithm does.  But if blank node cycles are not permitted, such as by prohibiting explicit blank nodes (but permitting implicit blank nodes generated by [] notation in Turtle), then the whole graph does *not* need to be considered.  Nodes can be efficiently and consistently labeled bottom-up if the graph is a tree with respect to blank node connections -- i.e., it has no blank node cycles.  The graph could still have cycles that involve URIs though -- those do not cause a problem.

Now, I wonder if I have hit on the motive that lead Aiden's algorithm....

> 
> Thanks,
> David Booth
> 
>
Received on Wednesday, 28 November 2018 15:41:39 UTC