Fwd: Re: RDF canonicalization [was Re: deterministic naming of blank nodes]

I am very glad to hear that the W3C Credentials Community Group has 
taken on the work of specifying RDF graph canonicalization!   It would 
be really great if a standard canonicalization algorithm could minimize 
blank node relabeling when a graph is changed slightly, to the extent 
feasible, since this would make diffs more precise.  Any thoughts on the 
points below?  How feasible does this sound?

David Booth

-------- Forwarded Message --------
Subject: Re: RDF canonicalization [was Re: deterministic naming of blank 
nodes]
Date: Wed, 20 May 2015 22:49:36 -0400
From: David Booth <david@dbooth.org>
To: ahogan@dcc.uchile.cl, semantic-web@w3.org

Hi Aidan,

On 05/17/2015 03:54 PM, ahogan@dcc.uchile.cl wrote:
> On 14/05/2015 01:20, David Booth wrote:
>> On 05/13/2015 02:14 AM, ahogan@dcc.uchile.cl wrote:
>>> Aidan Hogan. "Skolemising Blank Nodes while Preserving Isomorphism". In
>>> WWW, Florence, Italy, May 18–22, 2015 (to appear).
>>> Available from: http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
[ . . . ]
>> How robust is your algorithm to small changes in a graph?  Do you have
>> any data on how much the canonicalization changes when a graph is changed?
>
> [ . . . ] one has to be careful when talking about creating similar
> canonicalisations for similar graphs. Consider for example producing an
> overall hash for two large graphs G and H that differ only in the value of
> one IRI in one triple.

I see your point about creating similar hashes for similar graphs, but
my question was intended to be about the *canonicalizations* of similar
graphs -- not the hashes of those graphs.  I don't mind if the hashes of
G and H are very different, but it would be nice if their
canonicalizations could be as similar as possible as often as possible,
so that standard diff and patch tools could be used on them.  Could your
algorithm be adapted that way?

I assume that a canonicalization algorithm will be limited in the degree
to which it can achieve this goal, and it may be desirable for the
algorithm to have knowledge of OWL semantics in order to get the best
results.  For example, if the graph implies that _:b1 owl:sameAs _:b2,
then it might replace all occurrences of _:b2 with _:b1 before
canonicalizing.   Or, if a graph contains:

   :alice foaf:knows _:b3 .
   _:b3 :hasSSN "123-45-6789" .
   :bob foaf:knows _:b4 .
   _:b4 :hasSSN "123-45-6789" .

   :hasSSN a owl:InverseFunctionalProperty .

then _:b3 and _:b4 can be merged into a single blank node prior to blank
node canonicalization.

Furthermore, if a blank node is known to have a key that uniquely
identifies it, then it would be best if the canonical name of that blank
node would depend only on those key properties (recursively).  For
example, given:

   :carol :owns _:b5 .
   _:b5 :make "Acme" ;
        :serialNumber "1234" ;
        :color "red" .

If :make and :serialNumber form a composite key, then the canonical name
of _:b5 should depend only on those properties -- not on :color.  That
would enable _:b6 in the following to be canonicalized to the same blank
node label, even though it specifies a :yearPurchased and does not
specify a :color :

   :carol :owns _:b6 .
   _:b6 :model "zinger" ;
        :serialNumber "1234" ;
        :yearPurchased "2013" .

OWL2 supports keys via owl:hasKey.
http://www.w3.org/TR/2009/WD-owl2-new-features-20090611/#F9:_Keys

Could your algorithm be extended to take advantage of OWL assertions?
For the best canonicalization results, this might mean making use of the
closure of graph G's imports to aid in canonicalizing graph G.  To do
this reliably it might be necessary for G to directly contain a crypto
hash of the closure of its imports, so that any change to the imported
files would be detectable.

David Booth

Received on Tuesday, 26 May 2015 02:24:53 UTC