RDF canonicalization [was Re: deterministic naming of blank nodes]

Hi David,


On 14/05/2015 01:20, David Booth wrote:
> Hi Aiden,
>
> On 05/13/2015 02:14 AM, ahogan@dcc.uchile.cl wrote:
>> Aidan Hogan. "Skolemising Blank Nodes while Preserving Isomorphism". In
>> WWW, Florence, Italy, May 18–22, 2015 (to appear).
>> Available from: http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf
>>
>> . . .  the paper presents an algorithm for deterministically
>> labelling blank nodes of any RDF graph such that the labels of two
>> isomorphic RDF graphs will "correspond".
>
> Excellent!  It would be really good if we could define a W3C standard
> N-Triples and N-Quads canonicalization, for a variety of purposes.
>
> But I have a question about your algorithm.  For greatest utility, a
> canonicalization algorithm should be robust against small changes in a
> graph.  In other words, if graphs G and H are very similar then their
> canonicalizations should also be very similar.  (By "very similar" I
> mean that G and H have large subgraphs G' and H' that are isomorphic.)
>
> How robust is your algorithm to small changes in a graph?  Do you have
> any data on how much the canonicalization changes when a graph is changed?

I must admit that I didn't really consider similarity when working on the
algorithm (actually if anything, I was focused on the opposite ... getting
unique hashes for graphs). In general, the algorithm itself can be
configured to use different hashing functions with different properties,
be they cryptographic or maybe similarity-based. The algorithm can also be
configured to ignore ground triples or to canonicalise each blank node
partition at a time, etc. So there's room for configuring the algorithm
based on the practical design requirements in question (and there's a
bunch of open questions in that regard ... or indeed intended applications
with different requirements).

However, one has to be careful when talking about creating similar
canonicalisations for similar graphs. Consider for example producing an
overall hash for two large graphs G and H that differ only in the value of
one IRI in one triple. We would thus consider G and H to be very similar
in the sense you give. Now say we want to produce a similar hash for these
graphs. But assuming that the set of IRIs is a priori "unbounded", there
is an unbounded number of graphs G' that differ from G in that same
position by one IRI. Now assuming that the hash function has a fixed
arity, intuitively the number of "similar hashes" to G is bounded while
the number of similar graphs to G (even in the limited sense of changing
one IRI in a fixed position) is unbounded ... under this assumptions, one
cannot guarantee unique but similar hashes for all similar graphs.

In other words, there are a lot more similar graphs than there can be
similar hashes. The only possible way out that I would see would be to
have a very low-level definition of similar graphs – based for example on
characters in a serialisation rather than terms in a graph – or to have a
hash large enough and a definition of similar hashes relaxed enough that
one can still compute similar hashes for similar graphs with a low risk of
hash collision.

In general though, I guess it would all depend on why you would need
similarity in the output hashes. I think there are easier ways to detect
graphs that are similar. :)

Cheers,
Aidan

Received on Sunday, 17 May 2015 19:55:14 UTC