Re: Thoughts on the LDS WG chartering discussion

Hi David,

> On 10 Jun 2021, at 07:27, David Booth <david@dbooth.org> wrote:
> 
> Hi Phil,
> 
> On 6/9/21 6:48 AM, Phil Archer wrote:
>> . . .
>> 1. Why is there any need to sign a graph and not just the bytes? See the explainer document at
>> https://w3c.github.io/lds-wg-charter/explainer.html#noProblem
>> for the answer to this.
> 
> Sorry to belabor this, but I read the explainer document, and I still do not see an answer to this question.
> 
> The section you referenced refers to the "Constrained data transfer" use case and the "Space-efficient verification of the contents of Datasets" use case.  It concludes: "In these scenarios, a signature on the original file, such as a JSON signature on a JSON-LD file, is not appropriate, as the conversion will make it invalid."  I assume "the conversion" means the conversion of the original JSON-LD file to a different RDF serialization, and that sentence is pointing out that the hash of the original JSON-LD file will not match a hash of the different serialization.  But clearly the hash should be taken of a *canonicalized* original, and when it is converted to a different serialization, the recipient must re-canonicalize it before checking the hash.   This is, in essence, exactly what "RDF Dataset Hash (RDH)" in the charter does anyway.
> 
> To my mind, this is analogous to computing the hash of an arbitrary file, compressing the file for transmission (which puts it into a different serialization that is informationally equivalent), and then having the recipient decompress the file before verifying the hash. Serializing RDF to a non-canonical form is analogous to compression: you have to put it back to the canonical form (analogy: decompress it) before checking the hash.
> 
> I agree that to make this work, a canonical RDF *serialization* is needed.  But I do not see the need to canonicalize the *abstract* RDF Dataset (though it is nice to have the canonicalization algorithm defined in a way that allows it to be easily applied to several RDF serializations).  In fact, the proposed "RDF Dataset Hash (RDH)" is actually computed on the canonicalized *serialization* anyway.  It is NOT computed directly on the abstract RDF dataset.  And if the hash is computed on those serialized bytes anyway, then really it is the serialized bytes that are being signed.

You are right that the hash is computed on a specific "canonical" serialization, which is n-quads (things like the canonical sorting of the quads, what to do with white spaces, etc, must be specified, but that is, comparatively, trivial). It is also true that if one had only a specific serialization of a graph/dataset only, and that dataset representation (ie, the text file) was transmitted *verbatim* from sender to receiver, we would not need all this; we could define a canonical version of, say, Turtle, and be over with it.

But. If I "just" start by, say, a Turtle representation of a Graph, I can of course convert that into canonical n-quads and hash the n-quads. But if the same Turtle representation is converted by RDFLib (or any other tool) into, God forbid, RDF/XML, the BNode identifiers will be different, ie, the conversion of the RDF/XML to n-quads will be different and, consequently, the hash will be different. *Unless the RDF canonicalization assigns the canonical identifiers to the BNodes in the process.*

So I am not really sure I actually understand your problem: you cannot avoid a canonical relabeling of the BNodes in the general case. That is what the abstract RDF canonicalization does: define canonical BNode labels in a serialization independent manner. In my view, that is absolutely necessary in general.

Ivan

(Of course, if we did not have BNodes than all this discussion would not be unnecessary. But we do have them…)

<skip>

> 
> 
> Thanks,
> David Booth
> 
> 


----
Ivan Herman, W3C 
Home: http://www.w3.org/People/Ivan/
mobile: +33 6 52 46 00 43
ORCID ID: https://orcid.org/0000-0003-0782-2704

Received on Thursday, 10 June 2021 07:40:58 UTC