Re: RDF canonicalization [was Re: deterministic naming of blank nodes] from Gregg Kellogg on 2015-05-14 (public-json-ld@w3.org from May 2015)

From: Gregg Kellogg <gregg@greggkellogg.com>
Date: Thu, 14 May 2015 13:31:10 -0700
To: David Booth <david@dbooth.org>
Cc: "public-json-ld@w3.org" <public-json-ld@w3.org>, Gregg Kellogg <gregg@greggkellogg.net>, Aidan Hogan <ahogan@dcc.uchile.cl>
Message-ID: <1269459178310301314@unknownmsgid>
On May 14, 2015, at 12:53 PM, David Booth <david@dbooth.org> wrote:
>
> FYI.  I did not realize that work was active on JSON-LD
> canonicalization, but I see the draft document was updated April 8 2015.
>  I'm glad to see it is!

Yes, it's required for RDF Signatures used for payments and credentials.

> It would be nice if the JSON-LD canonicalization algorithm were aligned
> with an N-Triples / N-Quads canonicalization algorithm, and if both
> could ensure blank node stability in the face of small graph changes.
> would this be feasible?

Even though it's in the JSON-LD repo, it is an RDF Dataset algorithm,
that doesn't depend on JSON-LD, except that the test inputs are
serialized this way. The results are normalized N-Triples/Quads.

Gregg

> David Booth
>
> -------- Forwarded Message --------
> Subject: Re: RDF canonicalization [was Re: deterministic naming of blank
> nodes]
> Date: Thu, 14 May 2015 15:44:04 -0400
> From: David Booth <david@dbooth.org>
> To: Gregg Kellogg <gregg@greggkellogg.net>
> CC: Arto Bendiken <arto@bendiken.net>, W3C Semantic Web IG
> <semantic-web@w3.org>
>
> On 05/14/2015 02:50 PM, Gregg Kellogg wrote:
>>> On May 14, 2015, at 10:27 AM, David Booth <david@dbooth.org>
>>> wrote:
>>>
>>>> On 05/14/2015 01:03 PM, Arto Bendiken wrote:
>>>> On Thu, May 14, 2015 at 6:20 AM, David Booth <david@dbooth.org>
>>>> wrote:
>>>>> On 05/13/2015 02:14 AM, ahogan@dcc.uchile.cl wrote:
>>>>>>
>>>>>> Aidan Hogan. "Skolemising Blank Nodes while Preserving
>>>>>> Isomorphism". In WWW, Florence, Italy, May 18–22, 2015 (to
>>>>>> appear). Available from:
>>>>>> http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
>>>>>>
>>>>>> . . .  the paper presents an algorithm for deterministically
>>>>>> labelling blank nodes of any RDF graph such that the labels
>>>>>> of two isomorphic RDF graphs will "correspond".
>>>>>
>>>>>
>>>>> Excellent!  It would be really good if we could define a W3C
>>>>> standard N-Triples and N-Quads canonicalization, for a variety
>>>>> of purposes.
>>>>>
>>>>> But I have a question about your algorithm.  For greatest
>>>>> utility, a canonicalization algorithm should be robust against
>>>>> small changes in a graph.  In other words, if graphs G and H
>>>>> are very similar then their canonicalizations should also be
>>>>> very similar.  (By "very similar" I mean that G and H have
>>>>> large subgraphs G' and H' that are isomorphic.)
>>>>>
>>>>> How robust is your algorithm to small changes in a graph?  Do
>>>>> you have any data on how much the canonicalization changes when
>>>>> a graph is changed?
>>>>
>>>> Still reading the paper, but given the use of cryptographic
>>>> hashing (p. 9), any minor change will result in a cascade effect
>>>> that will produce a wholly different output.
>>>>
>>>> Further, this is precisely what one wants if the
>>>> canonicalization algorithm is to be used for cryptographic
>>>> applications such as graph signing, as per e.g. the Credentials
>>>> CG. Similarity would be a different problem that entails
>>>> conflicting requirements.
>>>
>>> For the graph as a whole, yes, one would certainly want the hash of
>>> the graph to change if any part of the graph changes.  But I was
>>> talking about the algorithm's re-labeling of the blank nodes within
>>> the graph. For graph comparison purposes, it would be very
>>> beneficial if the algorithm would tend to generate the same blank
>>> node labels even when a graph changes slightly.  Are you saying
>>> that there is also value, for cryptographic hashing purposes, in
>>> the algorithm *changing* blank node labels if any part of the graph
>>> changes?  If so, why?  Can you explain?
>>
>> If the canonicalized BNode IDs are based on a hash of the dependent
>> statements, then the same ID should be generated for a graph having
>> more statements that do not relate to those BNodes. I would say that
>> the BNode IDs generated from the union of two disjoint graphs should
>> be the same as the BNode IDs generated from the union of those
>> graphs; this would make it useful for doing an RDF Diff.
>>
>> Note that the proposed RDF Normalization spec [1] does not generate
>> BNode IDs this way, and so would not have this desired purpose,
>> although I believe the algorithm could be changed to accommodate
>> this.
>>
>> The RDF Isomorphism algorithm used in the Ruby rdf-isomorphic gem,
>> based on Jeremy Carrol’s [2] generates BNode identifiers by creating
>> a hash of dependent nodes, and should have this property, for
>> disjoint sub-graphs, anyway.
>
> I think blank node stability is a critical property for a
> canonicalization algorithm to have (to the extent feasible), to enable
> graph diffs to be easily computed using standard diff tools.  It would
> be a huge lost opportunity if we standardized a canonicalization
> algorithm without this property.
>
> I'll forward this to the public-json-ld@w3.org list.
>
> Thanks!
> David Booth
>
>>
>> Gregg
>>
>> [1] http://json-ld.org/spec/latest/rdf-dataset-normalization/ [2]
>> http://www.hpl.hp.com/techreports/2001/HPL-2001-293.pdf
>>
>>> David Booth
>
>
Received on Thursday, 14 May 2015 20:31:39 UTC