Re: RDF canonicalization [was Re: deterministic naming of blank nodes] from David Booth on 2015-05-21 (semantic-web@w3.org from May 2015)

From: David Booth <david@dbooth.org>
Date: Wed, 20 May 2015 22:49:36 -0400
To: ahogan@dcc.uchile.cl, semantic-web@w3.org
Message-ID: <555D47C0.8070506@dbooth.org>
Hi Aidan,

On 05/17/2015 03:54 PM, ahogan@dcc.uchile.cl wrote:
> On 14/05/2015 01:20, David Booth wrote:
>> On 05/13/2015 02:14 AM, ahogan@dcc.uchile.cl wrote:
>>> Aidan Hogan. "Skolemising Blank Nodes while Preserving Isomorphism". In
>>> WWW, Florence, Italy, May 18–22, 2015 (to appear).
>>> Available from: http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
[ . . . ]
>> How robust is your algorithm to small changes in a graph?  Do you have
>> any data on how much the canonicalization changes when a graph is changed?
>
> [ . . . ] one has to be careful when talking about creating similar
> canonicalisations for similar graphs. Consider for example producing an
> overall hash for two large graphs G and H that differ only in the value of
> one IRI in one triple.

I see your point about creating similar hashes for similar graphs, but 
my question was intended to be about the *canonicalizations* of similar 
graphs -- not the hashes of those graphs.  I don't mind if the hashes of 
G and H are very different, but it would be nice if their 
canonicalizations could be as similar as possible as often as possible, 
so that standard diff and patch tools could be used on them.  Could your 
algorithm be adapted that way?

I assume that a canonicalization algorithm will be limited in the degree 
to which it can achieve this goal, and it may be desirable for the 
algorithm to have knowledge of OWL semantics in order to get the best 
results.  For example, if the graph implies that _:b1 owl:sameAs _:b2, 
then it might replace all occurrences of _:b2 with _:b1 before 
canonicalizing.   Or, if a graph contains:

   :alice foaf:knows _:b3 .
   _:b3 :hasSSN "123-45-6789" .
   :bob foaf:knows _:b4 .
   _:b4 :hasSSN "123-45-6789" .

   :hasSSN a owl:InverseFunctionalProperty .

then _:b3 and _:b4 can be merged into a single blank node prior to blank 
node canonicalization.

Furthermore, if a blank node is known to have a key that uniquely 
identifies it, then it would be best if the canonical name of that blank 
node would depend only on those key properties (recursively).  For 
example, given:

   :carol :owns _:b5 .
   _:b5 :make "Acme" ;
        :serialNumber "1234" ;
        :color "red" .

If :make and :serialNumber form a composite key, then the canonical name 
of _:b5 should depend only on those properties -- not on :color.  That 
would enable _:b6 in the following to be canonicalized to the same blank 
node label, even though it specifies a :yearPurchased and does not 
specify a :color :

   :carol :owns _:b6 .
   _:b6 :model "zinger" ;
        :serialNumber "1234" ;
        :yearPurchased "2013" .

OWL2 supports keys via owl:hasKey.
http://www.w3.org/TR/2009/WD-owl2-new-features-20090611/#F9:_Keys

Could your algorithm be extended to take advantage of OWL assertions? 
For the best canonicalization results, this might mean making use of the 
closure of graph G's imports to aid in canonicalizing graph G.  To do 
this reliably it might be necessary for G to directly contain a crypto 
hash of the closure of its imports, so that any change to the imported 
files would be detectable.

David Booth
Received on Thursday, 21 May 2015 02:50:05 UTC