Re: Diff'ing RDF files from thomas@pellissier-tanon.fr on 2024-09-14 (semantic-web@w3.org from September 2024)

From: <thomas@pellissier-tanon.fr>
Date: Sat, 14 Sep 2024 06:50:14 +0000
To: Pierre-Antoine Champin <pierre-antoine@w3.org>
Cc: Semantic Web <semantic-web@w3.org>, Florian Kleedorfer <florian.kleedorfer@austria.fm>, Felix Sasaki <felix.sasaki@sap.com>, RDF-star WG <public-rdf-star-wg@w3.org>
Message-ID: <CME1k9UtZRVQsT4XJuNWNzFl8kRvM9rY1CDTONWiWdHlga62bEYonvBG0jz1E7JhZWCU08a6vRQpEbk>

> This is due to the fact that even a small difference can cause the 
canonicalization to relabel blank node in a completely different way. So even blank nodes that were not impacted by the change may end up with different names, and so the text diff applied to the canonical form will report those as changes.

A way to circumvent this issue is to tweak the "Issue Identifier Algorithm" part of RDF canonicalization to assign an identifier based on the node hash instead of a global counter. This way only blank nodes that have a path with only blank nodes vertices to the changed triples will get their ids changed and the other ones will stay the same.

Thomas


Le vendredi 13 septembre 2024 à 20:10, Florian Kleedorfer <florian.kleedorfer@austria.fm> a écrit :

> 
> 
> Hi,
> 
> Curious this is coming up just as an effort to get consistent formatting
> for RDF (TTL for now) out the door on behalf of QUDT.
> 
> Looked into canonicalization but the downside you mention is a
> non-starter if you want to track changes with a version control system,
> so we're just reproducing the input order of blank nodes by hacking into
> the jena TTL parser.
> 
> code: https://github.com/atextor/turtle-formatter
> 
> which is being plugged into
> 
> https://github.com/diffplug/spotless/ (maven plugin for now)
> 
> Bottom line: you'll be able to format TTL consistently with the spotless
> maven plugin soonish. Maybe one day, you won't even lose your comments.
> 
> Reach out if you want to help making it work for other formats or if you
> want a gradle/sbt plugin
> 
> Best regards,
> Florian
> 
> Am 2024-09-13 16:18, schrieb Pierre-Antoine Champin:
> 
> > Dear all,
> > 
> > yesterday during the RDF-star working group call, I mentioned that RDF
> > canonicalization [1] can be used to build a crude RDF "diff" tool, and
> > that I was using a small script that I wrote for that. Other
> > participants expressed interest for this script, so I cleaned it up a
> > bit and published it here:
> > 
> > https://gist.github.com/pchampin/7017fa5ff607e5bedf65e2f9954cfd46
> > 
> > As indicated at the top, it relies on my Sophia library [2] for parsing
> > and canonicalizing, but it can be easily adapted to use other
> > command-line tools (for a while, I was using Gregg Kellogg's Ruby
> > implementation [3]).
> > 
> > Note that I describe it as a crude tool because
> > 
> > - if the two graphs/dataset are isomorphic (i.e. identical modulo blank
> > node labels), it will show no difference,
> > - BUT if there is only the slightest difference, the tool may report a
> > lot of changes, not all of them relevant.
> > 
> > This is due to the fact that even a small difference can cause the
> > canonicalization to relabel blank node in a completely different way.
> > So even blank nodes that were not impacted by the change may end up
> > with different names, and so the text diff applied to the canonical
> > form will report those as changes.
> > 
> > But despite these "false positives", I find it quite useful, and you
> > might too. In particular, if the changes only impact triples/quads on
> > IRIs and literals, the diff will be "exact".
> > 
> > best
> > 
> > [1] https://github.com/w3c/rdf-canon
> > [2] https://github.com/pchampin/sophia_rs
> > [3] https://ruby-rdf.github.io/

Received on Saturday, 14 September 2024 06:50:26 UTC