Re: RDF canonicalization [was Re: deterministic naming of blank nodes]

On 05/14/2015 01:03 PM, Arto Bendiken wrote:
> On Thu, May 14, 2015 at 6:20 AM, David Booth <david@dbooth.org> wrote:
>> On 05/13/2015 02:14 AM, ahogan@dcc.uchile.cl wrote:
>>>
>>> Aidan Hogan. "Skolemising Blank Nodes while Preserving Isomorphism". In
>>> WWW, Florence, Italy, May 18–22, 2015 (to appear).
>>> Available from: http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
>>>
>>> . . .  the paper presents an algorithm for deterministically
>>> labelling blank nodes of any RDF graph such that the labels of two
>>> isomorphic RDF graphs will "correspond".
>>
>>
>> Excellent!  It would be really good if we could define a W3C standard
>> N-Triples and N-Quads canonicalization, for a variety of purposes.
>>
>> But I have a question about your algorithm.  For greatest utility, a
>> canonicalization algorithm should be robust against small changes in a
>> graph.  In other words, if graphs G and H are very similar then their
>> canonicalizations should also be very similar.  (By "very similar" I mean
>> that G and H have large subgraphs G' and H' that are isomorphic.)
>>
>> How robust is your algorithm to small changes in a graph?  Do you have any
>> data on how much the canonicalization changes when a graph is changed?
>
> Still reading the paper, but given the use of cryptographic hashing
> (p. 9), any minor change will result in a cascade effect that will
> produce a wholly different output.
>
> Further, this is precisely what one wants if the canonicalization
> algorithm is to be used for cryptographic applications such as graph
> signing, as per e.g. the Credentials CG. Similarity would be a
> different problem that entails conflicting requirements.

For the graph as a whole, yes, one would certainly want the hash of the 
graph to change if any part of the graph changes.  But I was talking 
about the algorithm's re-labeling of the blank nodes within the graph. 
For graph comparison purposes, it would be very beneficial if the 
algorithm would tend to generate the same blank node labels even when a 
graph changes slightly.  Are you saying that there is also value, for 
cryptographic hashing purposes, in the algorithm *changing* blank node 
labels if any part of the graph changes?  If so, why?  Can you explain?

David Booth

Received on Thursday, 14 May 2015 17:27:38 UTC