Re: Thoughts on the LDS WG chartering discussion from David Booth on 2021-06-10 (semantic-web@w3.org from June 2021)

From: David Booth <david@dbooth.org>
Date: Thu, 10 Jun 2021 01:27:36 -0400
To: semantic-web@w3.org
Message-ID: <2861533f-66cc-1752-957d-1991c72c7721@dbooth.org>
Hi Phil,

On 6/9/21 6:48 AM, Phil Archer wrote:
> . . .
> 1. Why is there any need to sign a graph and 
> not just the bytes? See the explainer document at
> https://w3c.github.io/lds-wg-charter/explainer.html#noProblem
> for the answer to this.

Sorry to belabor this, but I read the explainer document, and I still do 
not see an answer to this question.

The section you referenced refers to the "Constrained data transfer" use 
case and the "Space-efficient verification of the contents of Datasets" 
use case.  It concludes: "In these scenarios, a signature on the 
original file, such as a JSON signature on a JSON-LD file, is not 
appropriate, as the conversion will make it invalid."  I assume "the 
conversion" means the conversion of the original JSON-LD file to a 
different RDF serialization, and that sentence is pointing out that the 
hash of the original JSON-LD file will not match a hash of the different 
serialization.  But clearly the hash should be taken of a 
*canonicalized* original, and when it is converted to a different 
serialization, the recipient must re-canonicalize it before checking the 
hash.   This is, in essence, exactly what "RDF Dataset Hash (RDH)" in 
the charter does anyway.

To my mind, this is analogous to computing the hash of an arbitrary 
file, compressing the file for transmission (which puts it into a 
different serialization that is informationally equivalent), and then 
having the recipient decompress the file before verifying the hash. 
Serializing RDF to a non-canonical form is analogous to compression: you 
have to put it back to the canonical form (analogy: decompress it) 
before checking the hash.

I agree that to make this work, a canonical RDF *serialization* is 
needed.  But I do not see the need to canonicalize the *abstract* RDF 
Dataset (though it is nice to have the canonicalization algorithm 
defined in a way that allows it to be easily applied to several RDF 
serializations).  In fact, the proposed "RDF Dataset Hash (RDH)" is 
actually computed on the canonicalized *serialization* anyway.  It is 
NOT computed directly on the abstract RDF dataset.  And if the hash is 
computed on those serialized bytes anyway, then really it is the 
serialized bytes that are being signed.

It is a leap of faith to believe that the signing of those RDF bytes 
indicates that the signer agrees with the semantic content that those 
bytes represent.  But that is the same leap of faith that we take when 
the bytes represent a PDF document that was digitally signed.  The leap 
of faith is that we can interpret those bytes as they were semantically 
intended -- either as RDF, PDF or whatever.

So unfortunately I seem to be missing a fairly fundamental point here, 
because I am still not understanding what benefit is to be gained by 
restricting the source documents to RDF.  Why not allow them to be other 
kinds of documents also, such as PDF?

Or, to recast my question in terms of Manu's summary:

On 6/6/21 4:52 PM, Manu Sporny wrote:
 > 1. Define a generalized canonicalization mechanism for
 >     abstract RDF Datasets.
 >
 > 2. Define a way of serializing and hashing the
 >     canonicalized form from #1.
 >
 > 3. Define a way of expressing digital signatures (proofs)
 >     using the hashed form of the RDF Dataset from #2.

Why do the digital signatures need to be restricted to using the hash of 
canonicalized RDF, as opposed to using the hash of, say, a PDF document? 
  Wouldn't people want to digitally sign PDF documents too?  Why 
shouldn't they use the same RDF digital signature vocabulary to talk 
about PDF documents instead of RDF Datasets?

I feel like I'm missing some fundamental assumption in your intended use 
case.

Thanks,
David Booth
Received on Thursday, 10 June 2021 05:28:13 UTC