Re: RDF canonicalization [was Re: deterministic naming of blank nodes]

> On May 14, 2015, at 10:27 AM, David Booth <david@dbooth.org> wrote:
> 
> On 05/14/2015 01:03 PM, Arto Bendiken wrote:
>> On Thu, May 14, 2015 at 6:20 AM, David Booth <david@dbooth.org> wrote:
>>> On 05/13/2015 02:14 AM, ahogan@dcc.uchile.cl wrote:
>>>> 
>>>> Aidan Hogan. "Skolemising Blank Nodes while Preserving Isomorphism". In
>>>> WWW, Florence, Italy, May 18–22, 2015 (to appear).
>>>> Available from: http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
>>>> 
>>>> . . .  the paper presents an algorithm for deterministically
>>>> labelling blank nodes of any RDF graph such that the labels of two
>>>> isomorphic RDF graphs will "correspond".
>>> 
>>> 
>>> Excellent!  It would be really good if we could define a W3C standard
>>> N-Triples and N-Quads canonicalization, for a variety of purposes.
>>> 
>>> But I have a question about your algorithm.  For greatest utility, a
>>> canonicalization algorithm should be robust against small changes in a
>>> graph.  In other words, if graphs G and H are very similar then their
>>> canonicalizations should also be very similar.  (By "very similar" I mean
>>> that G and H have large subgraphs G' and H' that are isomorphic.)
>>> 
>>> How robust is your algorithm to small changes in a graph?  Do you have any
>>> data on how much the canonicalization changes when a graph is changed?
>> 
>> Still reading the paper, but given the use of cryptographic hashing
>> (p. 9), any minor change will result in a cascade effect that will
>> produce a wholly different output.
>> 
>> Further, this is precisely what one wants if the canonicalization
>> algorithm is to be used for cryptographic applications such as graph
>> signing, as per e.g. the Credentials CG. Similarity would be a
>> different problem that entails conflicting requirements.
> 
> For the graph as a whole, yes, one would certainly want the hash of the graph to change if any part of the graph changes.  But I was talking about the algorithm's re-labeling of the blank nodes within the graph. For graph comparison purposes, it would be very beneficial if the algorithm would tend to generate the same blank node labels even when a graph changes slightly.  Are you saying that there is also value, for cryptographic hashing purposes, in the algorithm *changing* blank node labels if any part of the graph changes?  If so, why?  Can you explain?

If the canonicalized BNode IDs are based on a hash of the dependent statements, then the same ID should be generated for a graph having more statements that do not relate to those BNodes. I would say that the BNode IDs generated from the union of two disjoint graphs should be the same as the BNode IDs generated from the union of those graphs; this would make it useful for doing an RDF Diff.

Note that the proposed RDF Normalization spec [1] does not generate BNode IDs this way, and so would not have this desired purpose, although I believe the algorithm could be changed to accommodate this.

The RDF Isomorphism algorithm used in the Ruby rdf-isomorphic gem, based on Jeremy Carrol’s [2] generates BNode identifiers by creating a hash of dependent nodes, and should have this property, for disjoint sub-graphs, anyway.

Gregg

[1] http://json-ld.org/spec/latest/rdf-dataset-normalization/
[2] http://www.hpl.hp.com/techreports/2001/HPL-2001-293.pdf

> David Booth
> 

Received on Thursday, 14 May 2015 18:50:54 UTC