Re: Blank Nodes Re: Toward easier RDF: a proposal from Nathan Rixham on 2018-11-28 (semantic-web@w3.org from November 2018)

From: Nathan Rixham <nathan@webr3.org>
Date: Wed, 28 Nov 2018 16:14:37 +0000
To: Henry Story <henry.story@bblfish.net>
Cc: David Booth <david@dbooth.org>, W3C Semantic Web IG <semantic-web@w3.org>
Message-ID: <CANiy74wbt9n4g9VkoDb47LH7gB3sziU4hQjFZDcx6KokuVQYdg@mail.gmail.com>
On Wed, Nov 28, 2018 at 3:46 PM Henry Story <henry.story@bblfish.net> wrote:

> > On 28 Nov 2018, at 05:16, David Booth <david@dbooth.org> wrote:
> >
> > On 11/27/18 10:47 PM, Thomas Passin wrote:
> >> On 11/27/2018 10:01 PM, David Booth wrote:
> >>> On 11/27/18 2:04 PM, Nathan Rixham wrote:
> >>> . . .
> >>>> Here's an extract:
> >>>> {
> >>>>     ...
> >>>>    "name": "County Assessor's Office",
> >>>>    "address": {
> >>>>      "@type": "PostalAddress",
> >>>>      "streetAddress": "123 West Jefferson Street",
> >>>>      "addressLocality": "Phoenix",
> >>>>      "addressRegion": "AZ",
> >>>>      "postalCode": "85003",
> >>>>      "addressCountry": "US"
> >>>>    },
> >>>>    "geo": {
> >>>>      "@type": "GeoCoordinates",
> >>>>      "latitude": 33.4466,
> >>>>      "longitude": -112.07837  },
> >>>> }
> >>>> . . .
> >>>> [To] have the same address or geo coordinates published on tens of
> thousands of different websites, all using a different ID (uri) would be a
> huge, horrible, mess.
> >>>
> >>> Not so fast.  Two points:
> >>>
> >>>   - Unless you make a unique name assumption with URIs, that huge,
> horrible mess is pretty much the situation we already have using blank
> nodes.  Except that in some ways the current situation is *worse*, because
> the same data loaded twice cause duplicate triples (non-lean), whereas that
> would be automatically avoided if URIs were usesd.
> >> But the key point here is that they might or might not be duplicates.
> And the types and predicates (the semantics of table and column names, if
> you get right down to it, since a lot of linked data comes from relational
> databases) might or might not be the same.  There has to be some way to get
> decent assurances that they *are* the same, before the graphs get merged.
> Tinkering with the RDF specs, and having ways to canonically name blank
> nodes, won't handle this problem. It's a data and semantics problem instead.
>
> That is mostly because there are no globally clear ways of identifying two
> identical addresses. That could indeed
> be solved by having a URN scheme globally understood for addresses that
> would be easy to coin - a kind of more
> powerful postcode.
>
> A person looking at the Json sees the same address because they think of
> of a number of things:
>   1. that "address" is a functional property.
>     That is usually what OO programmers tend to think since all their
> attributes are unique. If they then want more than one address they need to
> move to having a relation to a set. Then they have the same problem since
> they would need to work out when two addresses are the same.
>   2.  they could decide that two addresses are the same if they have
> exactly the same attributes and values.
>
>
>  I wonder if in that case 2. there is not a recipe for when one can merge
> such blank nodes. I don't think that semantically
> the following two graphs G1 and G2 below are different:
>
> G1 =  {
> :joe :address [ a "PostalAddress";
>      :streetAddress "123 West Jefferson Street";
>      :addressLocality "Phoenix";
>      :addressRegion "AZ";
>      :postalCode "85003";
>      :addressCountry "US";
>    ];
>    :address [ a "PostalAddress";
>      :streetAddress "123 West Jefferson Street";
>      :addressLocality "Phoenix";
>      :addressRegion "AZ";
>      :postalCode "85003";
>      :addressCountry "US";
>    ].
> }
>
>
> G2 = {
> :joe :address [ a "PostalAddress";
>      :streetAddress "123 West Jefferson Street";
>      :addressLocality "Phoenix";
>      :addressRegion  "AZ";
>      :postalCode "85003";
>      :addressCountry "US";
>    ].
> }
>
> That would be satisfied in all the same models, it seems to me. No?
>

Yes, saying some address with properties x,y,z exists twice is the same as
saying it exists once.

correct that there is no globally clear way of identifying two identical
addreses, or accounting for all the deviations in ways of saying them, just
as the precision of geo coordinates can vary yet point to the "same"
physical location.

I feel like language has a great deal to do with this, if we referred to
this address thing as an unidentified object, and looked in our databases,
documents, code, apis, we'd find a huge portion of them are comprised of
these unidentified objects, where the set of property value pairs is their
identity, an identity that's good enough for purpose.

Under this unidientified object scenario, to be considering identifiers for
unidentified things seems like a strange question, as the whole point is
that it's unidentified.

Realistically, saying we require everything to have a name/identifier/uri
is just a no go. Immediate real world first responses would be (a) invalid
rdf as the IDs would be ommitted, or (b) encoding of objects in strings as
string values, as in a chunk of json or xml frag in a string property.

Now, IMHO there's merit in generating IDs for bnodes, but behind the
interface not over wire, for use in canonicalization or storage engines or
code - *not* in a serialized document sent between parties. To say these
must or should be in a turtle document or json-ld document, or that every
system that outputs structured data has to now implement a way to make the
unidentified identifiable in a reproducable way.. because? why?
Received on Wednesday, 28 November 2018 16:15:10 UTC