Diff'ing RDF files from Pierre-Antoine Champin on 2024-09-13 (semantic-web@w3.org from September 2024)

From: Pierre-Antoine Champin <pierre-antoine@w3.org>
Date: Fri, 13 Sep 2024 16:18:54 +0200
To: Semantic Web <semantic-web@w3.org>, Felix Sasaki <felix.sasaki@sap.com>, RDF-star WG <public-rdf-star-wg@w3.org>
Message-ID: <9fdff3c7-a8b3-474f-8b30-f4853816a04b@w3.org>

Dear all,

yesterday during the RDF-star working group call, I mentioned that RDF 
canonicalization [1] can be used to build a crude RDF "diff" tool, and 
that I was using a small script that I wrote for that. Other 
participants expressed interest for this script, so I cleaned it up a 
bit and published it here:

https://gist.github.com/pchampin/7017fa5ff607e5bedf65e2f9954cfd46


As indicated at the top, it relies on my Sophia library [2] for parsing 
and canonicalizing, but it can be easily adapted to use other 
command-line tools (for a while, I was using Gregg Kellogg's Ruby 
implementation [3]).

Note that I describe it as a *crude* tool because

- if the two graphs/dataset are isomorphic (i.e. identical modulo blank 
node labels), it will show no difference,
- BUT if there is only the slightest difference, the tool may report a 
lot of changes, not all of them relevant.

This is due to the fact that even a small difference can cause the 
canonicalization to relabel blank node in a completely different way. So 
even blank nodes that were not impacted by the change may end up with 
different names, and so the text diff applied to the canonical form will 
report those as changes.

But despite these "false positives", I find it quite useful, and you 
might too. In particular, if the changes only impact triples/quads on 
IRIs and literals, the diff will be "exact".

   best

[1] https://github.com/w3c/rdf-canon

[2] https://github.com/pchampin/sophia_rs

[3] https://ruby-rdf.github.io/

Attachments

application/pgp-keys attachment: OpenPGP public key

Received on Friday, 13 September 2024 14:18:59 UTC