Re: Thoughts on the LDS WG chartering discussion from David Booth on 2021-06-10 (semantic-web@w3.org from June 2021)

From: David Booth <david@dbooth.org>
Date: Thu, 10 Jun 2021 10:13:12 -0400
To: semantic-web@w3.org
Message-ID: <e700d749-0653-be1e-355f-9cf35673635c@dbooth.org>
On 6/10/21 3:40 AM, Ivan Herman wrote:
> . . .
> If I "just" start by, say, a Turtle representation of a Graph, I 
> can of course convert that into canonical n-quads and hash the n-quads. 
> But if the same Turtle representation is converted by RDFLib (or any 
> other tool) into, God forbid, RDF/XML, the BNode identifiers will be 
> different, ie, the conversion of the RDF/XML to n-quads will be 
> different and, consequently, the hash will be different. *Unless the RDF 
> canonicalization assigns the canonical identifiers to the BNodes in the 
> process.*

Yes, of course the hash will be different if you have not first 
canonicalized back to the canonical N-Quads format before checking the 
hash.  But that's like saying that if you send a compressed file then 
the hash of the compressed file won't match the hash of the original 
file.  Of course it won't: you need to decompress it before checking the 
hash.

> So I am not really sure I actually understand your problem: you cannot 
> avoid a canonical relabeling of the BNodes in the general case. That is 
> what the abstract RDF canonicalization does: define canonical BNode 
> labels in a serialization independent manner. In my view, that is 
> absolutely necessary in general.

I don't think that conclusion logically follows.  Don't get me wrong, I 
see the value in defining a canonicalization algorithm that can be used 
on a whole family of RDF serializations, which is what the proposed 
algorithm does.  But I do not see it as *necessary* to solve the 
problem.  AFAICT only *one* canonical serialization -- such as canonical 
N-Quads -- is actually needed to enable any isomorphic RDF serialization 
to transmitted, given that we can already convert between various RDF 
serializations and obtain isomorphic datasets.  (And to whatever extent 
our current serializations/libraries do not produce isomorphic results 
then that is a bug that needs to be fixed.)  All the sender and receiver 
need to do is agree to compute the hash on a canonical N-Quads 
serialization of the RDF dataset that is transmitted, even if that RDF 
dataset is transmitted in a completely different serialization.  In 
fact, if I've understood correctly, that's exactly what the proposed 
"RDF Dataset Hash (RDH)" algorithm does.  In fact, in the proposed 
charter, I don't recall seeing the result of the abstract RDF Dataset 
canonicalization being used for *anything* other than to produce a 
canonical N-Quads serialization.  That seems to me like pretty 
compelling evidence that the *abstract* canonicalization is not actually 
needed: only the canonical N-Quads serialization is really needed.  So I 
don't understand your view that the *abstract* canonicalization is 
"absolutely necessary".

I still feel like I am somehow missing a fundamental assumption that 
others are making and I have not yet been able to identify.

Thanks,
David Booth
Received on Thursday, 10 June 2021 14:15:27 UTC