- From: Pierre-Antoine Champin <pierre-antoine@w3.org>
- Date: Thu, 14 Sep 2023 12:56:59 +0200
- To: Sebastian Crane <seabass-labrax@gmx.com>, public-rch-wg@w3.org
- Message-ID: <e6c8571b-7116-1262-fae2-d37935a35528@w3.org>
On 12/09/2023 23:03, Sebastian Crane wrote: > (...) > > What is lost, however, is utility of rdf-canon to the Semantic Web and > Linked Data ecosystem. The two general use-cases for data > canonicalisation are A: identifying duplicate datasets, and B: > creating a unique and consistent identifier of a dataset, from that > dataset itself. > > (...) > > Situation B is what the recent resolution threatens. Because the choice > of internal hashing algorithm (again, used for disambiguation, not > security) affects the result, there are as many possible identifiers of > a dataset as there are hashing algorithms. Not such a unique value now! I don't see how this is different from the situation we have with hash values for checking integrity of files. If you only give me a file and *some* hash value, it is useless. You also need to tell me which hash function you used to compute that hash. For RDF-datasets, this is more complex, and was already more complex before this decision from the group: you needed to specify 1) which c14n function you had use (URDNA2015, RDFC-1.0) and which hash function you used on the output. With the proposed change, we have one more moving part (which is meant as a feature, not a bug), but this does not qualitatively change the requirements to provide adequate metadata with a hash value to make it usable. > > If you receive a value described simply as a canonicalized hash of a > given RDF dataset, and would like to reproduce that value from the > dataset, you have no idea which hashing algorithm was used > internally. You must run a brute-force check for each possible hashing > algorithm that exists. > > > - It harms interoperability, as implementations will need to support > multiple internal hashing algorithms. Even 'fully conformant' > implementations may simply fail to succeed if they do not implement, for > instance, a newer hashing algorithm. If people that can't live with SHA-256 were to reinvent their own version of rdf-canon from scratch, this would hurt interoperability even more. > - It indirectly harms security, as these implementations will have a > larger attack surface area - not a risk specifically to do with hashing > as a computation, but simply because larger codebases have a greater > risk of security-critical vulnerabilities. My implementation, and I suspect others as well, relies on a library of hash functions that provides a bunch of them. This change does not significantly change the size of the codebase. > - It harms performance and energy-efficiency, because all the datasets' > blank nodes (a quantity often expressed in orders of magnitude) must be > hashed repeatedly with different algorithms. I don't get that argument. I expect that each application ecosystem will chose /one/ hash function that works for them. > - It harms ease of implementation, since some resource-constrained > devices simply do not have the available capacity to have tens of hash > algorithms installed. The spec only requires two. > RDF is valuable in embedded and edge computing > contexts, but this resolution may jeopardise this use-case. > > > I hope it is clear that the change harms, to a lesser or greater extent, > almost every aspect of rdf-canon, in return for letting us avoid a > mostly arbitrary decision of what non-security-critical hashing > algorithm to use. My feeling is that you are overemphasizing the harms. That being said, +1 to try and prevent them, by (for example): - Add some text about the need to provide sufficient metadata with the final hash to make it usable (e.g. by using multibase). - Coin standard IRIs for identifying the standard "processing chains" from RDF dataset to hash value (i.e. "RDFC 1.0 using SHA-256, then hashed with SHA-256", "RDFC 1.0 using SHA-384, then hashed with SHA-384" ), that external methods for conveying the metadata (e.g. VC) could use. - Make SHA-256 the "official default" function, and add some guidance about interoperability ("do not use another hash function unless strongly required to"). My 2ยข pa > > There are specification such as as the 'multibase' family of formats > which would allow users to annotate the hash values, addressing the > performance problem and most of the interoperability concern. However, > even this partial solution only works outside of rdf-canon; as I alluded > to earlier, it means that rdf-canon will become effectively useless for > my 'scenario B' without a format like multibase to wrap the > result. Likewise, use of rdf-canon inside Verifiable Credentials may > address some of the issues due to the richer metadata that VCs > provide. This is metadata that does not need to exist, though, if we > simply make an informed decision and choose one hash algorithm. > > I am more than happy to respond to any queries about the issues which I > have raised above. I believe that many of them have already been raised > by various participants in prior RDF Canonicalization and Hashing > Working Group meetings, but have been dismissed prematurely due to our > enthusiasm to enter our Candidate Recommendation stage. > > What I would ask of the chairs and my fellow participants in the WG is > to consider the difficulties of industry and community fragmentation > that could potentially arise in the event that specific wrapper formats > and hash algorithms do not immediately become dominant among > stakeholders, and how we can minimise that risk ourselves by making > rdf-canon as well-specified as possible before entering CR. > > Best wishes, > > Sebastian >
Attachments
- application/pgp-keys attachment: OpenPGP public key
Received on Thursday, 14 September 2023 10:57:04 UTC