Re: RDF canonicalization [was Re: deterministic naming of blank nodes]

On 05/14/2015 02:50 PM, Gregg Kellogg wrote:
>> On May 14, 2015, at 10:27 AM, David Booth <>
>> wrote:
>> On 05/14/2015 01:03 PM, Arto Bendiken wrote:
>>> On Thu, May 14, 2015 at 6:20 AM, David Booth <>
>>> wrote:
>>>> On 05/13/2015 02:14 AM, wrote:
>>>>> Aidan Hogan. "Skolemising Blank Nodes while Preserving
>>>>> Isomorphism". In WWW, Florence, Italy, May 18–22, 2015 (to
>>>>> appear). Available from:
>>>>> . . .  the paper presents an algorithm for deterministically
>>>>> labelling blank nodes of any RDF graph such that the labels
>>>>> of two isomorphic RDF graphs will "correspond".
>>>> Excellent!  It would be really good if we could define a W3C
>>>> standard N-Triples and N-Quads canonicalization, for a variety
>>>> of purposes.
>>>> But I have a question about your algorithm.  For greatest
>>>> utility, a canonicalization algorithm should be robust against
>>>> small changes in a graph.  In other words, if graphs G and H
>>>> are very similar then their canonicalizations should also be
>>>> very similar.  (By "very similar" I mean that G and H have
>>>> large subgraphs G' and H' that are isomorphic.)
>>>> How robust is your algorithm to small changes in a graph?  Do
>>>> you have any data on how much the canonicalization changes when
>>>> a graph is changed?
>>> Still reading the paper, but given the use of cryptographic
>>> hashing (p. 9), any minor change will result in a cascade effect
>>> that will produce a wholly different output.
>>> Further, this is precisely what one wants if the
>>> canonicalization algorithm is to be used for cryptographic
>>> applications such as graph signing, as per e.g. the Credentials
>>> CG. Similarity would be a different problem that entails
>>> conflicting requirements.
>> For the graph as a whole, yes, one would certainly want the hash of
>> the graph to change if any part of the graph changes.  But I was
>> talking about the algorithm's re-labeling of the blank nodes within
>> the graph. For graph comparison purposes, it would be very
>> beneficial if the algorithm would tend to generate the same blank
>> node labels even when a graph changes slightly.  Are you saying
>> that there is also value, for cryptographic hashing purposes, in
>> the algorithm *changing* blank node labels if any part of the graph
>> changes?  If so, why?  Can you explain?
> If the canonicalized BNode IDs are based on a hash of the dependent
> statements, then the same ID should be generated for a graph having
> more statements that do not relate to those BNodes. I would say that
> the BNode IDs generated from the union of two disjoint graphs should
> be the same as the BNode IDs generated from the union of those
> graphs; this would make it useful for doing an RDF Diff.
> Note that the proposed RDF Normalization spec [1] does not generate
> BNode IDs this way, and so would not have this desired purpose,
> although I believe the algorithm could be changed to accommodate
> this.
> The RDF Isomorphism algorithm used in the Ruby rdf-isomorphic gem,
> based on Jeremy Carrol’s [2] generates BNode identifiers by creating
> a hash of dependent nodes, and should have this property, for
> disjoint sub-graphs, anyway.

I think blank node stability is a critical property for a 
canonicalization algorithm to have (to the extent feasible), to enable 
graph diffs to be easily computed using standard diff tools.  It would 
be a huge lost opportunity if we standardized a canonicalization 
algorithm without this property.

I'll forward this to the list.

David Booth

> Gregg
> [1] [2]
>> David Booth

Received on Thursday, 14 May 2015 19:44:33 UTC