Re: Blank Nodes Re: Toward easier RDF: a proposal from Gregg Kellogg on 2018-11-28 (semantic-web@w3.org from November 2018)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Wed, 28 Nov 2018 11:08:42 -0800
To: David Booth <david@dbooth.org>
Cc: semantic-web@w3.org
Message-Id: <C4E4D661-2DA7-43C5-8848-B29645B8005E@greggkellogg.net>
> On Nov 27, 2018, at 8:16 PM, David Booth <david@dbooth.org> wrote:
> 
> On 11/27/18 10:47 PM, Thomas Passin wrote:
>> On 11/27/2018 10:01 PM, David Booth wrote:
>>> On 11/27/18 2:04 PM, Nathan Rixham wrote:
>>> . . .
>>>> Here's an extract:
>>>> {
>>>>     ...
>>>>    "name": "County Assessor's Office",
>>>>    "address": {
>>>>      "@type": "PostalAddress",
>>>>      "streetAddress": "123 West Jefferson Street",
>>>>      "addressLocality": "Phoenix",
>>>>      "addressRegion": "AZ",
>>>>      "postalCode": "85003",
>>>>      "addressCountry": "US"
>>>>    },
>>>>    "geo": {
>>>>      "@type": "GeoCoordinates",
>>>>      "latitude": 33.4466,
>>>>      "longitude": -112.07837  },
>>>> }
>>>> . . .
>>>> [To] have the same address or geo coordinates published on tens of thousands of different websites, all using a different ID (uri) would be a huge, horrible, mess.
>>> 
>>> Not so fast.  Two points:
>>> 
>>>   - Unless you make a unique name assumption with URIs, that huge, horrible mess is pretty much the situation we already have using blank nodes.  Except that in some ways the current situation is *worse*, because the same data loaded twice cause duplicate triples (non-lean), whereas that would be automatically avoided if URIs were usesd.
>> But the key point here is that they might or might not be duplicates. And the types and predicates (the semantics of table and column names, if you get right down to it, since a lot of linked data comes from relational databases) might or might not be the same.  There has to be some way to get decent assurances that they *are* the same, before the graphs get merged.  Tinkering with the RDF specs, and having ways to canonically name blank nodes, won't handle this problem. It's a data and semantics problem instead.
> 
> Perhaps we are talking about different things.  To my mind, if the above example appears in two different datasets, with the same @context so that the same triples are generated (except for blank node labels), then they *are* duplicates and they *do* mean the same thing.  And if the same predictable URI is generated instead of a blank node each time the JSON-LD uses curly braces {}, then to my mind that would be a *good* thing, because those bits of RDF, even though that come from different sources, *are* the same address and *do* mean the same thing.

At times, I’ve considered that document-relative URIs would be a good alternative to BNodes in such cases, but this ends up complicating things if the same node shows up in different documents. Skolemization is, of course, an option, but this is limited to the data source, and not in very much actual use, in my experience.

I understand the interpretation that BNodes are existential quantifiers, but this does not help where they are also identifiers for specific vertices in a graph which are not labeled with URIs and are not literals. On the one hand, as shown by graph merge semantics, it’s perfectly fine for there to be several different vertices with the same incoming and outgoing edges, and that an existential quantifier can match any of them, but the resulting graphs are not isomorphic, as the nodes (vertices) in one graph may outnumber those in another graph.

The problem comes down the dual nature of BNodes as existential quantifiers and identifiers for concrete, but unlabeled nodes within the serialization of a particular graph and how they are treated when merging graphs.

Gregg

> At least, that's how I look at it.  Please explain further if I've misunderstood your point.
> 
> > On top of that, many of these data graphs that one wants
> > to merge won't be either isomorphic to each other, or be
> > subsets or supersets.  In that situation, I don't see how
> > a blank node identifying algorithm that has to traverse
> > and consider the whole graph can spit out identifiers that
> > will make corresponding blank nodes in the various graphs
> > reliably have to the same identifier.  That's the kind of
> > algorithm that Aidan Hogan's papers talk about, isn't it,
> > ones that consider the entire graph?
> 
> It has to consider more of the graph *if* blank node cycles are permitted, and that is what Aiden's algorithm does.  But if blank node cycles are not permitted, such as by prohibiting explicit blank nodes (but permitting implicit blank nodes generated by [] notation in Turtle), then the whole graph does *not* need to be considered.  Nodes can be efficiently and consistently labeled bottom-up if the graph is a tree with respect to blank node connections -- i.e., it has no blank node cycles.  The graph could still have cycles that involve URIs though -- those do not cause a problem.
> 
> Thanks,
> David Booth
> 
>
Received on Wednesday, 28 November 2018 19:09:08 UTC