Blank Nodes as Graph Identifiers are NOT required for Normalization from Gavin Carothers on 2013-02-26 (public-rdf-wg@w3.org from February 2013)

From: Gavin Carothers <gavin@carothers.name>
Date: Mon, 25 Feb 2013 18:24:10 -0800
To: RDF-WG WG <public-rdf-wg@w3.org>
Message-ID: <CAPqY83xbmJA5HD4OwHNWr-HobmEQnhK79irPZ4pChr1u9+HDNA@mail.gmail.com>

{
  "@context": ...,
  "@graph": [
    {
      "@graph": {
        "name": "Joe"
      }
    },
    {
      "@graph": {
        "name": "Susan"
      }
    }
  ]
}

These are two graphs, and we need to create unique names for them for a
normalization algorithm.

Graph 1 expressed as Turtle:

@prefix : <http://example.com/ns/>

[] :name "Joe" .

Graph 2 expressed as Turtle:

@prefix : <http://example.com/ns/>

[] :name "Susan" .

So far so good. The normalization is in terms of N-Quads however, and
therefor needs both names for the graph, and labels for the blank nodes.
Lets start by putting each graph into N-Triples.

Graph 1 expressed as N-Triples:

_:c14n1 <http://example.com/ns/name> "Joe" .

Graph 2 expressed as N-Triples:

_:c14n1 <http://example.com/ns/name> "Susan" .

Ah ha, we've used the same blank label for both! But that's okay at the
moment since both exist as graphs in their own right. Lets take the md5sum
of both:

Graph 1 md5sum:

12e775c37a0e6a327ace2114bb5a1b47

Graph 2 md5sum:

a44173cbf95beeee164add48a1201b24

Now, lets create that N-Quads document:

_:c14n-12e775c37a0e6a327ace2114bb5a1b47-1 <http://example.com/ns/name>
"Joe" <urn:hash:application/n-triples:md5:12e775c37a0e6a327ace2114bb5a1b47>
.
_:c14n-a44173cbf95beeee164add48a1201b24-1 <http://example.com/ns/name>
"Susan"
<urn:hash:application/n-triples:md5:a44173cbf95beeee164add48a1201b24> .

So why exactly don't hashes work for identifying graphs that ALREADY have
to be normalized? If we're very worried about collusion (we shouldn't be,
there is assumed to be a better cryptographic method being used to really
sign these documents in your use case) replace md5 with sha512 or Whirlpool.

The argument I see is that what if blank nodes are shared between graphs in
a dataset. That seems to be a "mere" matter of designing the normalization
method to be stable at some point and then defining all the labels you
need. Changing from labels into hash based IRIs at the last moment
shouldn't make it any harder. (Which is not the same thing as saying it's
easy) While processing using blank nodes INTERNALLY makes perfect sense.
Many reasoners and other software uses blank nodes in places that RDF
doesn't allow for in their internals.

This is NOT a generic argument for how to handle unlabeled graphs in
JSON-LD, which I still think are a poor idea that can't be expressed in any
other graph synatx, nor used with SPARQL. Just saying that blank nodes as
graph labels are NOT required for the normalization use case.

Cheers,
Gavin

Received on Tuesday, 26 February 2013 02:24:38 UTC