Re: RDF canonicalization [was Re: deterministic naming of blank nodes]

On 05/14/2015 02:50 PM, Gregg Kellogg wrote:
>> On May 14, 2015, at 10:27 AM, David Booth <david@dbooth.org>
>> wrote:
>>
>> On 05/14/2015 01:03 PM, Arto Bendiken wrote:
>>> On Thu, May 14, 2015 at 6:20 AM, David Booth <david@dbooth.org>
>>> wrote:
>>>> On 05/13/2015 02:14 AM, ahogan@dcc.uchile.cl wrote:
>>>>>
>>>>> Aidan Hogan. "Skolemising Blank Nodes while Preserving
>>>>> Isomorphism". In WWW, Florence, Italy, May 18–22, 2015 (to
>>>>> appear). Available from:
>>>>> http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
>>>>>
>>>>> . . .  the paper presents an algorithm for deterministically
>>>>> labelling blank nodes of any RDF graph such that the labels
>>>>> of two isomorphic RDF graphs will "correspond".
>>>>
>>>>
>>>> Excellent!  It would be really good if we could define a W3C
>>>> standard N-Triples and N-Quads canonicalization, for a variety
>>>> of purposes.
>>>>
>>>> But I have a question about your algorithm.  For greatest
>>>> utility, a canonicalization algorithm should be robust against
>>>> small changes in a graph.  In other words, if graphs G and H
>>>> are very similar then their canonicalizations should also be
>>>> very similar.  (By "very similar" I mean that G and H have
>>>> large subgraphs G' and H' that are isomorphic.)
>>>>
>>>> How robust is your algorithm to small changes in a graph?  Do
>>>> you have any data on how much the canonicalization changes when
>>>> a graph is changed?
>>>
>>> Still reading the paper, but given the use of cryptographic
>>> hashing (p. 9), any minor change will result in a cascade effect
>>> that will produce a wholly different output.
>>>
>>> Further, this is precisely what one wants if the
>>> canonicalization algorithm is to be used for cryptographic
>>> applications such as graph signing, as per e.g. the Credentials
>>> CG. Similarity would be a different problem that entails
>>> conflicting requirements.
>>
>> For the graph as a whole, yes, one would certainly want the hash of
>> the graph to change if any part of the graph changes.  But I was
>> talking about the algorithm's re-labeling of the blank nodes within
>> the graph. For graph comparison purposes, it would be very
>> beneficial if the algorithm would tend to generate the same blank
>> node labels even when a graph changes slightly.  Are you saying
>> that there is also value, for cryptographic hashing purposes, in
>> the algorithm *changing* blank node labels if any part of the graph
>> changes?  If so, why?  Can you explain?
>
> If the canonicalized BNode IDs are based on a hash of the dependent
> statements, then the same ID should be generated for a graph having
> more statements that do not relate to those BNodes. I would say that
> the BNode IDs generated from the union of two disjoint graphs should
> be the same as the BNode IDs generated from the union of those
> graphs; this would make it useful for doing an RDF Diff.
>
> Note that the proposed RDF Normalization spec [1] does not generate
> BNode IDs this way, and so would not have this desired purpose,
> although I believe the algorithm could be changed to accommodate
> this.
>
> The RDF Isomorphism algorithm used in the Ruby rdf-isomorphic gem,
> based on Jeremy Carrol’s [2] generates BNode identifiers by creating
> a hash of dependent nodes, and should have this property, for
> disjoint sub-graphs, anyway.

I think blank node stability is a critical property for a 
canonicalization algorithm to have (to the extent feasible), to enable 
graph diffs to be easily computed using standard diff tools.  It would 
be a huge lost opportunity if we standardized a canonicalization 
algorithm without this property.

I'll forward this to the public-json-ld@w3.org list.

Thanks!
David Booth

>
> Gregg
>
> [1] http://json-ld.org/spec/latest/rdf-dataset-normalization/ [2]
> http://www.hpl.hp.com/techreports/2001/HPL-2001-293.pdf
>
>> David Booth
>>
>
>
>
>
>

Received on Thursday, 14 May 2015 19:44:33 UTC