- From: <ahogan@dcc.uchile.cl>
- Date: Sun, 17 May 2015 16:54:44 -0300
- To: semantic-web@w3.org
- Cc: david@dbooth.org
Hi David, On 14/05/2015 01:20, David Booth wrote: > Hi Aiden, > > On 05/13/2015 02:14 AM, ahogan@dcc.uchile.cl wrote: >> Aidan Hogan. "Skolemising Blank Nodes while Preserving Isomorphism". In >> WWW, Florence, Italy, May 18–22, 2015 (to appear). >> Available from: http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf >> >> . . . the paper presents an algorithm for deterministically >> labelling blank nodes of any RDF graph such that the labels of two >> isomorphic RDF graphs will "correspond". > > Excellent! It would be really good if we could define a W3C standard > N-Triples and N-Quads canonicalization, for a variety of purposes. > > But I have a question about your algorithm. For greatest utility, a > canonicalization algorithm should be robust against small changes in a > graph. In other words, if graphs G and H are very similar then their > canonicalizations should also be very similar. (By "very similar" I > mean that G and H have large subgraphs G' and H' that are isomorphic.) > > How robust is your algorithm to small changes in a graph? Do you have > any data on how much the canonicalization changes when a graph is changed? I must admit that I didn't really consider similarity when working on the algorithm (actually if anything, I was focused on the opposite ... getting unique hashes for graphs). In general, the algorithm itself can be configured to use different hashing functions with different properties, be they cryptographic or maybe similarity-based. The algorithm can also be configured to ignore ground triples or to canonicalise each blank node partition at a time, etc. So there's room for configuring the algorithm based on the practical design requirements in question (and there's a bunch of open questions in that regard ... or indeed intended applications with different requirements). However, one has to be careful when talking about creating similar canonicalisations for similar graphs. Consider for example producing an overall hash for two large graphs G and H that differ only in the value of one IRI in one triple. We would thus consider G and H to be very similar in the sense you give. Now say we want to produce a similar hash for these graphs. But assuming that the set of IRIs is a priori "unbounded", there is an unbounded number of graphs G' that differ from G in that same position by one IRI. Now assuming that the hash function has a fixed arity, intuitively the number of "similar hashes" to G is bounded while the number of similar graphs to G (even in the limited sense of changing one IRI in a fixed position) is unbounded ... under this assumptions, one cannot guarantee unique but similar hashes for all similar graphs. In other words, there are a lot more similar graphs than there can be similar hashes. The only possible way out that I would see would be to have a very low-level definition of similar graphs – based for example on characters in a serialisation rather than terms in a graph – or to have a hash large enough and a definition of similar hashes relaxed enough that one can still compute similar hashes for similar graphs with a low risk of hash collision. In general though, I guess it would all depend on why you would need similarity in the output hashes. I think there are easier ways to detect graphs that are similar. :) Cheers, Aidan
Received on Sunday, 17 May 2015 19:55:14 UTC