Re: Blank Nodes Re: Toward easier RDF: a proposal from David Booth on 2018-11-28 (semantic-web@w3.org from November 2018)

From: David Booth <david@dbooth.org>
Date: Tue, 27 Nov 2018 23:16:18 -0500
To: semantic-web@w3.org
Message-ID: <f585dadf-1a08-f328-f66f-8e06f26e4687@dbooth.org>
On 11/27/18 10:47 PM, Thomas Passin wrote:
> On 11/27/2018 10:01 PM, David Booth wrote:
>> On 11/27/18 2:04 PM, Nathan Rixham wrote:
>> . . .
>>> Here's an extract:
>>> {
>>>     ...
>>>    "name": "County Assessor's Office",
>>>    "address": {
>>>      "@type": "PostalAddress",
>>>      "streetAddress": "123 West Jefferson Street",
>>>      "addressLocality": "Phoenix",
>>>      "addressRegion": "AZ",
>>>      "postalCode": "85003",
>>>      "addressCountry": "US"
>>>    },
>>>    "geo": {
>>>      "@type": "GeoCoordinates",
>>>      "latitude": 33.4466,
>>>      "longitude": -112.07837  },
>>> }
>>> . . .
>>> [To] have the same address or geo coordinates published on tens of 
>>> thousands of different websites, all using a different ID (uri) would 
>>> be a huge, horrible, mess.
>>
>> Not so fast.  Two points:
>>
>>   - Unless you make a unique name assumption with URIs, that huge, 
>> horrible mess is pretty much the situation we already have using blank 
>> nodes.  Except that in some ways the current situation is *worse*, 
>> because the same data loaded twice cause duplicate triples (non-lean), 
>> whereas that would be automatically avoided if URIs were usesd.
> 
> But the key point here is that they might or might not be duplicates. 
> And the types and predicates (the semantics of table and column names, 
> if you get right down to it, since a lot of linked data comes from 
> relational databases) might or might not be the same.  There has to be 
> some way to get decent assurances that they *are* the same, before the 
> graphs get merged.  Tinkering with the RDF specs, and having ways to 
> canonically name blank nodes, won't handle this problem. It's a data and 
> semantics problem instead.

Perhaps we are talking about different things.  To my mind, if the above 
example appears in two different datasets, with the same @context so 
that the same triples are generated (except for blank node labels), then 
they *are* duplicates and they *do* mean the same thing.  And if the 
same predictable URI is generated instead of a blank node each time the 
JSON-LD uses curly braces {}, then to my mind that would be a *good* 
thing, because those bits of RDF, even though that come from different 
sources, *are* the same address and *do* mean the same thing.

At least, that's how I look at it.  Please explain further if I've 
misunderstood your point.

 > On top of that, many of these data graphs that one wants
 > to merge won't be either isomorphic to each other, or be
 > subsets or supersets.  In that situation, I don't see how
 > a blank node identifying algorithm that has to traverse
 > and consider the whole graph can spit out identifiers that
 > will make corresponding blank nodes in the various graphs
 > reliably have to the same identifier.  That's the kind of
 > algorithm that Aidan Hogan's papers talk about, isn't it,
 > ones that consider the entire graph?

It has to consider more of the graph *if* blank node cycles are 
permitted, and that is what Aiden's algorithm does.  But if blank node 
cycles are not permitted, such as by prohibiting explicit blank nodes 
(but permitting implicit blank nodes generated by [] notation in 
Turtle), then the whole graph does *not* need to be considered.  Nodes 
can be efficiently and consistently labeled bottom-up if the graph is a 
tree with respect to blank node connections -- i.e., it has no blank 
node cycles.  The graph could still have cycles that involve URIs though 
-- those do not cause a problem.

Thanks,
David Booth
Received on Wednesday, 28 November 2018 04:16:41 UTC