Re: RDF Dataset Canonicalization - Formal Proof from Manu Sporny on 2021-03-27 (public-credentials@w3.org from March 2021)

From: Manu Sporny <msporny@digitalbazaar.com>
Date: Sat, 27 Mar 2021 14:49:20 -0400
To: W3C Credentials CG <public-credentials@w3.org>
Message-ID: <77510e57-29df-49d1-c442-feaa373fb8c7@digitalbazaar.com>

On 3/27/21 1:35 PM, Alan Karp wrote:
> It is possible to produce a digest without canonicalization by using a set 
> hash, which I see you mention in your paper.  I haven't read the whole
> thing yet, so I don't know why you need canonicalization. See
> https://www.hpl.hp.com/techreports/2003/HPL-2003-235R1.pdf

Hey Alan, I didn't know that you had worked in the graph digest space
before... neat! :)

I just read your paper front to back several times. I'm fairly certain I
understand it:

Fundamentally, the paper uses a cryptographic combination function over every
statement in a graph that is both associative and commutative to generate a
final hash value in O(n) (with some additional protections against adversarial
brute forcing).

There are two things the paper also depends on that make the solution
non-generalized: 1) you presume you know how the sender is going to label the
blank nodes, or 2) you annotate the graph with extra information about the
blank nodes. The first is an assumption you cannot depend on in a global
system, the second modifies the graph in ways that require an annotation
language and a way of encapsulating the annotations in the message to the
receiver (unacceptable complexity for most of the use cases we're dealing with).

In addition, the solution only contemplates RDF Graphs and not RDF Datasets
(which didn't exist at the time the paper was written).

So, while it is true that you can produce *a* digest w/o doing
canonicalization using set hashes, the solution presumes things that we can't
presume in real world scenarios. For example, a non-trivial subset of
developers are not going to understand why they have to "always generate a
unique cryptographic ID for every one of their JSON objects". Or why the data
coming into their system has a whole bunch of annotations in their JSON
objects that look like garbage to them, that they then have to turn around and
store in their databases.

We had considered all of these things before writing the RDF Dataset
Canonicalization paper.

-- manu

-- 
Manu Sporny - https://www.linkedin.com/in/manusporny/
Founder/CEO - Digital Bazaar, Inc.
blog: Veres One Decentralized Identifier Blockchain Launches
https://tinyurl.com/veres-one-launches

Received on Saturday, 27 March 2021 18:49:37 UTC