Re: Fwd: Re: RDF canonicalization [was Re: deterministic naming of blank nodes] from Dave Longley on 2015-05-26 (public-credentials@w3.org from May 2015)

From: Dave Longley <dlongley@digitalbazaar.com>
Date: Tue, 26 May 2015 10:41:53 -0400
To: David Booth <david@dbooth.org>, public-credentials@w3.org
CC: Aidan Hogan <ahogan@dcc.uchile.cl>
Message-ID: <55648631.8070102@digitalbazaar.com>
On 05/25/2015 10:24 PM, David Booth wrote:
> I am very glad to hear that the W3C Credentials Community Group has 
> taken on the work of specifying RDF graph canonicalization!   It would 
> be really great if a standard canonicalization algorithm could 
> minimize blank node relabeling when a graph is changed slightly, to 
> the extent feasible, since this would make diffs more precise.  Any 
> thoughts on the points below?  How feasible does this sound?

It's feasible but would come with some potential surprises in a number 
of edge cases. With the current algorithm, any change to a statement 
involving a blank node would change the identifier for that blank node 
and any other blank nodes that are connected to it through other blank 
nodes. In addition, any sub-isomorphisms would also be affected.

That being said, we've added a note to the spec to consider using the 
hashes we generate based on information about blank nodes as their 
identifiers instead of relabeling them with a `c14n` prefix. This would 
add a considerable amount of stability to them and therefore better 
support diffing.

Another option that could be considered would be to provide a hook for 
an optional custom callback that would provide "additional data" to be 
included during the blank node hashing process in order to produce a 
more unique and stable result. This could be part of a normalization API 
spec that uses the core algorithm with modifications for such hooks.

>
> David Booth
>
> -------- Forwarded Message --------
> Subject: Re: RDF canonicalization [was Re: deterministic naming of 
> blank nodes]
> Date: Wed, 20 May 2015 22:49:36 -0400
> From: David Booth <david@dbooth.org>
> To: ahogan@dcc.uchile.cl, semantic-web@w3.org
>
> Hi Aidan,
>
> On 05/17/2015 03:54 PM, ahogan@dcc.uchile.cl wrote:
>> On 14/05/2015 01:20, David Booth wrote:
>>> On 05/13/2015 02:14 AM, ahogan@dcc.uchile.cl wrote:
>>>> Aidan Hogan. "Skolemising Blank Nodes while Preserving 
>>>> Isomorphism". In
>>>> WWW, Florence, Italy, May 18–22, 2015 (to appear).
>>>> Available from: http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
> [ . . . ]
>>> How robust is your algorithm to small changes in a graph?  Do you have
>>> any data on how much the canonicalization changes when a graph is 
>>> changed?
>>
>> [ . . . ] one has to be careful when talking about creating similar
>> canonicalisations for similar graphs. Consider for example producing an
>> overall hash for two large graphs G and H that differ only in the 
>> value of
>> one IRI in one triple.
>
> I see your point about creating similar hashes for similar graphs, but
> my question was intended to be about the *canonicalizations* of similar
> graphs -- not the hashes of those graphs.  I don't mind if the hashes of
> G and H are very different, but it would be nice if their
> canonicalizations could be as similar as possible as often as possible,
> so that standard diff and patch tools could be used on them. Could your
> algorithm be adapted that way?
>
> I assume that a canonicalization algorithm will be limited in the degree
> to which it can achieve this goal, and it may be desirable for the
> algorithm to have knowledge of OWL semantics in order to get the best
> results.  For example, if the graph implies that _:b1 owl:sameAs _:b2,
> then it might replace all occurrences of _:b2 with _:b1 before
> canonicalizing.   Or, if a graph contains:
>
>   :alice foaf:knows _:b3 .
>   _:b3 :hasSSN "123-45-6789" .
>   :bob foaf:knows _:b4 .
>   _:b4 :hasSSN "123-45-6789" .
>
>   :hasSSN a owl:InverseFunctionalProperty .
>
> then _:b3 and _:b4 can be merged into a single blank node prior to blank
> node canonicalization.
>
> Furthermore, if a blank node is known to have a key that uniquely
> identifies it, then it would be best if the canonical name of that blank
> node would depend only on those key properties (recursively).  For
> example, given:
>
>   :carol :owns _:b5 .
>   _:b5 :make "Acme" ;
>        :serialNumber "1234" ;
>        :color "red" .
>
> If :make and :serialNumber form a composite key, then the canonical name
> of _:b5 should depend only on those properties -- not on :color. That
> would enable _:b6 in the following to be canonicalized to the same blank
> node label, even though it specifies a :yearPurchased and does not
> specify a :color :
>
>   :carol :owns _:b6 .
>   _:b6 :model "zinger" ;
>        :serialNumber "1234" ;
>        :yearPurchased "2013" .
>
> OWL2 supports keys via owl:hasKey.
> http://www.w3.org/TR/2009/WD-owl2-new-features-20090611/#F9:_Keys
>
> Could your algorithm be extended to take advantage of OWL assertions?
> For the best canonicalization results, this might mean making use of the
> closure of graph G's imports to aid in canonicalizing graph G.  To do
> this reliably it might be necessary for G to directly contain a crypto
> hash of the closure of its imports, so that any change to the imported
> files would be detectable.
>
> David Booth
>
>
>


-- 
Dave Longley
CTO
Digital Bazaar, Inc.
http://digitalbazaar.com
Received on Tuesday, 26 May 2015 14:42:17 UTC