4. Lack of standard RDF canonicalization from Ivan Herman on 2018-11-22 (semantic-web@w3.org from November 2018)

From: Ivan Herman <ivan@w3.org>
Date: Thu, 22 Nov 2018 18:08:32 +0100
To: David Booth <david@dbooth.org>
Cc: semantic-web <semantic-web@w3.org>, Dan Brickley <danbri@google.com>, "Sean B. Palmer" <sean@miscoranda.com>, Olaf Hartig <olaf.hartig@liu.se>, "Prof. Axel Polleres" <axel@polleres.net>
Message-Id: <936A9A07-7F03-420B-A248-0AE8139A2169@w3.org>

Hi David,

> 
> 4. Lack of standard RDF canonicalization.  Canonicalization
> is the ability to represent RDF in a consistent, predictable
> serialization.  It is essential for diff and digital signatures.
> Developers expect to be able to diff two files, and source
> control systems rely on being able to do so.  It is easy with
> most other data representations.  Why not RDF?  Answer: Blank
> nodes.  Unrestricted blank nodes cause RDF canonicalization
> to be a "hard problem", equivalent in complexity to the graph
> isomorphism problem.[6]
> 
> Some recent good progress on canonicalization: JSON-LD
> https://json-ld.github.io/normalization/spec/ <https://json-ld.github.io/normalization/spec/> .  However, the
> current JSON-LD canonicalization draft (called "normalization")
> is focused only on the digital signatures use case, and
> needs improvement to better address the diff use case, in
> which small, localized graph changes should result in small,
> localized differences in the canonicalized graph.
> 


There has been some discussions around this lately. If you are interested, look at:

https://github.com/w3c/strategy/issues/116 <https://github.com/w3c/strategy/issues/116>

In particular (specific comments as well as links from those comments):

https://github.com/w3c/strategy/issues/116#issuecomment-383875628 <https://github.com/w3c/strategy/issues/116#issuecomment-383875628>
https://github.com/w3c/strategy/issues/116#issuecomment-384160630 <https://github.com/w3c/strategy/issues/116#issuecomment-384160630>
https://github.com/w3c/strategy/issues/116#issuecomment-395791130 <https://github.com/w3c/strategy/issues/116#issuecomment-395791130>
https://github.com/w3c/strategy/issues/116#issuecomment-435920927 <https://github.com/w3c/strategy/issues/116#issuecomment-435920927>

http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf <http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf>
http://aidanhogan.com/docs/rdf-canonicalisation.pdf <http://aidanhogan.com/docs/rdf-canonicalisation.pdf>
http://json-ld.github.io/normalization/spec/index.html <http://json-ld.github.io/normalization/spec/index.html>
https://github.com/iherman/canonical_rdf <https://github.com/iherman/canonical_rdf>
https://lists.w3.org/Archives/Public/www-archive/2018Oct/0011.html <https://lists.w3.org/Archives/Public/www-archive/2018Oct/0011.html>

It is still not clear how exactly we will move forward, but I have some hopes that this will happen sometimes in 2019. It depends on the availability of the people involved; the path to get this done is now relatively clear.

All that being said: David's point is well taken on blank nodes. If there was no blank nodes around, it would be obvious. Looking at the details of the two available solutions (see points above) it is also true that there may be a middle ground: if the usage of blank nodes was somehow restricted avoiding circular patterns. I *think* (but I am not 100% sure) that if all blank nodes could be expressed by [] in turtle without any need for explicit bnode identifiers then both algorithms referred to above would become way simper.

Ivan

----
Ivan Herman, W3C 
Publishing@W3C Technical Lead
Home: http://www.w3.org/People/Ivan/ <http://www.w3.org/People/Ivan/>
mobile: +31-641044153
ORCID ID: https://orcid.org/0000-0003-0782-2704 <https://orcid.org/0000-0003-0782-2704>

Received on Thursday, 22 November 2018 17:08:40 UTC