Re: RDF canonicalization [was Re: deterministic naming of blank nodes] from Aidan Hogan on 2015-05-26 (semantic-web@w3.org from May 2015)

From: Aidan Hogan <ahogan@dcc.uchile.cl>
Date: Tue, 26 May 2015 17:35:06 -0300
To: David Booth <david@dbooth.org>, semantic-web@w3.org
Message-ID: <5564D8FA.1040609@dcc.uchile.cl>
Hi David,

Sorry for the delay ... some comments in line.

On 20/05/2015 23:49, David Booth wrote:
> Hi Aidan,
>
> On 05/17/2015 03:54 PM, ahogan@dcc.uchile.cl wrote:
>> On 14/05/2015 01:20, David Booth wrote:
>>> On 05/13/2015 02:14 AM, ahogan@dcc.uchile.cl wrote:
>>>> Aidan Hogan. "Skolemising Blank Nodes while Preserving Isomorphism". In
>>>> WWW, Florence, Italy, May 18–22, 2015 (to appear).
>>>> Available from: http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
> [ . . . ]
>>> How robust is your algorithm to small changes in a graph?  Do you have
>>> any data on how much the canonicalization changes when a graph is
>>> changed?
>>
>> [ . . . ] one has to be careful when talking about creating similar
>> canonicalisations for similar graphs. Consider for example producing an
>> overall hash for two large graphs G and H that differ only in the
>> value of
>> one IRI in one triple.
>
> I see your point about creating similar hashes for similar graphs, but
> my question was intended to be about the *canonicalizations* of similar
> graphs -- not the hashes of those graphs.  I don't mind if the hashes of
> G and H are very different, but it would be nice if their
> canonicalizations could be as similar as possible as often as possible,
> so that standard diff and patch tools could be used on them.  Could your
> algorithm be adapted that way?
>
> I assume that a canonicalization algorithm will be limited in the degree
> to which it can achieve this goal, and it may be desirable for the
> algorithm to have knowledge of OWL semantics in order to get the best
> results.  For example, if the graph implies that _:b1 owl:sameAs _:b2,
> then it might replace all occurrences of _:b2 with _:b1 before
> canonicalizing.   Or, if a graph contains:
>
>    :alice foaf:knows _:b3 .
>    _:b3 :hasSSN "123-45-6789" .
>    :bob foaf:knows _:b4 .
>    _:b4 :hasSSN "123-45-6789" .
>
>    :hasSSN a owl:InverseFunctionalProperty .
>
> then _:b3 and _:b4 can be merged into a single blank node prior to blank
> node canonicalization.
>
> Furthermore, if a blank node is known to have a key that uniquely
> identifies it, then it would be best if the canonical name of that blank
> node would depend only on those key properties (recursively).  For
> example, given:
>
>    :carol :owns _:b5 .
>    _:b5 :make "Acme" ;
>         :serialNumber "1234" ;
>         :color "red" .
>
> If :make and :serialNumber form a composite key, then the canonical name
> of _:b5 should depend only on those properties -- not on :color.  That
> would enable _:b6 in the following to be canonicalized to the same blank
> node label, even though it specifies a :yearPurchased and does not
> specify a :color :
>
>    :carol :owns _:b6 .
>    _:b6 :model "zinger" ;
>         :serialNumber "1234" ;
>         :yearPurchased "2013" .
>
> OWL2 supports keys via owl:hasKey.
> http://www.w3.org/TR/2009/WD-owl2-new-features-20090611/#F9:_Keys
>
> Could your algorithm be extended to take advantage of OWL assertions?
> For the best canonicalization results, this might mean making use of the
> closure of graph G's imports to aid in canonicalizing graph G.  To do
> this reliably it might be necessary for G to directly contain a crypto
> hash of the closure of its imports, so that any change to the imported
> files would be detectable.

Hmm ... I wouldn't say the algorithm I described could be easily 
extended in this direction no. I think this is quite a different problem 
to the one I was looking at, which was just considering an RDF graph 
structurally without any semantics.

Using keys to Skolemise blank nodes indeed might be a nice idea that's 
indeed practical in many cases and robust to changes. I know that "back 
in the day" a large majority of blank nodes in the wild had, e.g., 
inverse-functional properties attached to them, most of them referring 
to users from social networking sites like Livejournal and so on. I'm 
not sure what the statistics would be like in these days of Linked Data 
though ... I think the trend is towards using IRIs for anything with 
clear identity and using blank nodes just to glue bits of the RDF graph 
together. So while you could find "key information" attached to many 
blank nodes that you could hash for a Skolem IRI, there are also many 
that you could not. Maybe a Skolemisation scheme could look for keys 
first and if not available, use an algorithm like the one I proposed.

There would be a few open questions though, like what to do if a blank 
nodes has, say, multiple keys defined, and so forth (maybe create Skolem 
IRIs for each minimal key, pick a canonical one based on some criteria 
and link to the rest with sameAs).

(Again though, this was different to the focus of my paper.)

Best,
Aidan
Received on Tuesday, 26 May 2015 20:35:37 UTC