Re: Thoughts on the LDS WG chartering discussion from Ivan Herman on 2021-06-10 (semantic-web@w3.org from June 2021)

From: Ivan Herman <ivan@w3.org>
Date: Thu, 10 Jun 2021 17:08:51 +0200
To: David Booth <david@dbooth.org>
Cc: Semantic Web <semantic-web@w3.org>
Message-Id: <52C53DBA-AA31-4847-A3C0-1E6E43D2646E@w3.org>
> On 10 Jun 2021, at 16:13, David Booth <david@dbooth.org> wrote:
> 
> On 6/10/21 3:40 AM, Ivan Herman wrote:
>> . . .
>> If I "just" start by, say, a Turtle representation of a Graph, I can of course convert that into canonical n-quads and hash the n-quads. But if the same Turtle representation is converted by RDFLib (or any other tool) into, God forbid, RDF/XML, the BNode identifiers will be different, ie, the conversion of the RDF/XML to n-quads will be different and, consequently, the hash will be different. *Unless the RDF canonicalization assigns the canonical identifiers to the BNodes in the process.*
> 
> Yes, of course the hash will be different if you have not first canonicalized back to the canonical N-Quads format before checking the hash.  But that's like saying that if you send a compressed file then the hash of the compressed file won't match the hash of the original file.  Of course it won't: you need to decompress it before checking the hash.
> 
>> So I am not really sure I actually understand your problem: you cannot avoid a canonical relabeling of the BNodes in the general case. That is what the abstract RDF canonicalization does: define canonical BNode labels in a serialization independent manner. In my view, that is absolutely necessary in general.
> 
> I don't think that conclusion logically follows.  Don't get me wrong, I see the value in defining a canonicalization algorithm that can be used on a whole family of RDF serializations, which is what the proposed algorithm does.  But I do not see it as *necessary* to solve the problem.  AFAICT only *one* canonical serialization -- such as canonical N-Quads -- is actually needed to enable any isomorphic RDF serialization to transmitted, given that we can already convert between various RDF serializations and obtain isomorphic datasets.  (And to whatever extent our current serializations/libraries do not produce isomorphic results then that is a bug that needs to be fixed.)  All the sender and receiver need to do is agree to compute the hash on a canonical N-Quads serialization of the RDF dataset that is transmitted, even if that RDF dataset is transmitted in a completely different serialization.  In fact, if I've understood correctly, that's exactly what the proposed "RDF Dataset Hash (RDH)" algorithm does.  In fact, in the proposed charter, I don't recall seeing the result of the abstract RDF Dataset canonicalization being used for *anything* other than to produce a canonical N-Quads serialization.  That seems to me like pretty compelling evidence that the *abstract* canonicalization is not actually needed: only the canonical N-Quads serialization is really needed.  So I don't understand your view that the *abstract* canonicalization is "absolutely necessary".
> 
> I still feel like I am somehow missing a fundamental assumption that others are making and I have not yet been able to identify.

I wonder whether the misunderstanding is not the following: how do you calculate the canonical N-Quads? What will be the bnode labels? 

What the canonicalization algorithm does is to calculate the canonical bnode labels. I guess you could describe the algorithm as working on a quad representation of the RDF dataset, essentially transforming the quads by relabeling the bnode labels to a canonical version. But that is mathematically equivalent to making the same calculation on the abstract RDF data model. In this respect, the n-quads and the abstract model is essentially equivalent…

Ivan


> 
> Thanks,
> David Booth
> 


----
Ivan Herman, W3C 
Home: http://www.w3.org/People/Ivan/
mobile: +33 6 52 46 00 43
ORCID ID: https://orcid.org/0000-0003-0782-2704
Received on Thursday, 10 June 2021 15:11:28 UTC