Re: 4. Lack of standard RDF canonicalization from Tim Berners-Lee on 2018-11-24 (semantic-web@w3.org from November 2018)

From: Tim Berners-Lee <timbl@w3.org>
Date: Sat, 24 Nov 2018 13:12:22 +0000
To: Ivan Herman <ivan@w3.org>
Cc: David Booth <david@dbooth.org>, SW-forum Web <semantic-web@w3.org>, Dan Brickley <danbri@google.com>, "Sean B. Palmer" <sean@miscoranda.com>, Olaf Hartig <olaf.hartig@liu.se>, "Prof. Axel Polleres" <axel@polleres.net>
Message-Id: <4457EB9C-02D2-4B3C-A959-8A45D3355DE5@w3.org>

> On 2018-11 -22, at 17:08, Ivan Herman <ivan@w3.org> wrote:
> 
> Hi David,
> 
>> 
>> 4. Lack of standard RDF canonicalization.  Canonicalization
>> is the ability to represent RDF in a consistent, predictable
>> serialization.  It is essential for diff and digital signatures.
>> Developers expect to be able to diff two files, and source
>> control systems rely on being able to do so.  It is easy with
>> most other data representations.  Why not RDF?  Answer: Blank
>> nodes.  Unrestricted blank nodes cause RDF canonicalization
>> to be a "hard problem", equivalent in complexity to the graph
>> isomorphism problem.[6]
>> 
>> Some recent good progress on canonicalization: JSON-LD
>> https://json-ld.github.io/normalization/spec/ <https://json-ld.github.io/normalization/spec/> .  However, the
>> current JSON-LD canonicalization draft (called "normalization")
>> is focused only on the digital signatures use case, and
>> needs improvement to better address the diff use case, in
>> which small, localized graph changes should result in small,
>> localized differences in the canonicalized graph.
>> 
> 
> 
> There has been some discussions around this lately. If you are interested, look at:
> 
> https://github.com/w3c/strategy/issues/116 <https://github.com/w3c/strategy/issues/116>
> 
> In particular (specific comments as well as links from those comments):
> 
> https://github.com/w3c/strategy/issues/116#issuecomment-383875628 <https://github.com/w3c/strategy/issues/116#issuecomment-383875628>
> https://github.com/w3c/strategy/issues/116#issuecomment-384160630 <https://github.com/w3c/strategy/issues/116#issuecomment-384160630>
> https://github.com/w3c/strategy/issues/116#issuecomment-395791130 <https://github.com/w3c/strategy/issues/116#issuecomment-395791130>
> https://github.com/w3c/strategy/issues/116#issuecomment-435920927 <https://github.com/w3c/strategy/issues/116#issuecomment-435920927>
> 
> http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf <http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf>
> http://aidanhogan.com/docs/rdf-canonicalisation.pdf <http://aidanhogan.com/docs/rdf-canonicalisation.pdf>
> http://json-ld.github.io/normalization/spec/index.html <http://json-ld.github.io/normalization/spec/index.html>
> https://github.com/iherman/canonical_rdf <https://github.com/iherman/canonical_rdf>
> https://lists.w3.org/Archives/Public/www-archive/2018Oct/0011.html <https://lists.w3.org/Archives/Public/www-archive/2018Oct/0011.html>
> 
> It is still not clear how exactly we will move forward, but I have some hopes that this will happen sometimes in 2019. It depends on the availability of the people involved; the path to get this done is now relatively clear.
> 
> All that being said: David's point is well taken on blank nodes. If there was no blank nodes around, it would be obvious. Looking at the details of the two available solutions (see points above) it is also true that there may be a middle ground: if the usage of blank nodes was somehow restricted avoiding circular patterns. I *think* (but I am not 100% sure) that if all blank nodes could be expressed by [] in turtle without any need for explicit bnode identifiers then both algorithms referred to above would become way simper.


I think we should just do RDF canonicalization including blank nodes. 
It is not rocket science.
I have a little python program which does it, used it a lot for comparing test results.
An algorithm which works on real data is fine, it does not need to handle a n-dimentional hypercube of bnodes with no other nodes. It generates diffs.

Or maybe we should just stick with the LDJSON one and make sure it is in all the code bases.


Tim


> 
> Ivan
> 
> ----
> Ivan Herman, W3C 
> Publishing@W3C Technical Lead
> Home: http://www.w3.org/People/Ivan/ <http://www.w3.org/People/Ivan/>
> mobile: +31-641044153
> ORCID ID: https://orcid.org/0000-0003-0782-2704 <https://orcid.org/0000-0003-0782-2704>
>

Received on Saturday, 24 November 2018 13:12:29 UTC