- From: Sebastian Crane <seabass-labrax@gmx.com>
- Date: Tue, 12 Sep 2023 22:03:49 +0100
- To: public-rch-wg@w3.org
Dear all, Although I was unable to attend the RDF Canonicalization and Hashing Working Group meeting this week, I did review the minutes and am disappointed by the resolution to allow users of the standard to customize the hashing algorithm. I believe this limits considerably the utility of the rdf-canon specification by itself, effectively making it useless without a wrapping data format for some stakeholders. First of all, I would like to clarify that the change regards the internal hashing algorithm used in rdf-canon. This exists merely to disambiguate blank nodes, and doesn't protect data security. In our case, as for the discussions about one hashing algorithm being more secure than another, required by certain industry security standards or the hypothetical situation that an algorithm is broken, these do not correspond to any risk of loss of security or privacy. What is lost, however, is utility of rdf-canon to the Semantic Web and Linked Data ecosystem. The two general use-cases for data canonicalisation are A: identifying duplicate datasets, and B: creating a unique and consistent identifier of a dataset, from that dataset itself. In situation A, users already have access to both copies of the dataset, and can therefore choose whatever internal hashing algorithm they like. There is little importance to this decision, but critically, it can be made without harming interoperability or other factors. Situation B is what the recent resolution threatens. Because the choice of internal hashing algorithm (again, used for disambiguation, not security) affects the result, there are as many possible identifiers of a dataset as there are hashing algorithms. Not such a unique value now! If you receive a value described simply as a canonicalized hash of a given RDF dataset, and would like to reproduce that value from the dataset, you have no idea which hashing algorithm was used internally. You must run a brute-force check for each possible hashing algorithm that exists. - It harms interoperability, as implementations will need to support multiple internal hashing algorithms. Even 'fully conformant' implementations may simply fail to succeed if they do not implement, for instance, a newer hashing algorithm. - It indirectly harms security, as these implementations will have a larger attack surface area - not a risk specifically to do with hashing as a computation, but simply because larger codebases have a greater risk of security-critical vulnerabilities. - It harms performance and energy-efficiency, because all the datasets' blank nodes (a quantity often expressed in orders of magnitude) must be hashed repeatedly with different algorithms. - It harms ease of implementation, since some resource-constrained devices simply do not have the available capacity to have tens of hash algorithms installed. RDF is valuable in embedded and edge computing contexts, but this resolution may jeopardise this use-case. I hope it is clear that the change harms, to a lesser or greater extent, almost every aspect of rdf-canon, in return for letting us avoid a mostly arbitrary decision of what non-security-critical hashing algorithm to use. There are specification such as as the 'multibase' family of formats which would allow users to annotate the hash values, addressing the performance problem and most of the interoperability concern. However, even this partial solution only works outside of rdf-canon; as I alluded to earlier, it means that rdf-canon will become effectively useless for my 'scenario B' without a format like multibase to wrap the result. Likewise, use of rdf-canon inside Verifiable Credentials may address some of the issues due to the richer metadata that VCs provide. This is metadata that does not need to exist, though, if we simply make an informed decision and choose one hash algorithm. I am more than happy to respond to any queries about the issues which I have raised above. I believe that many of them have already been raised by various participants in prior RDF Canonicalization and Hashing Working Group meetings, but have been dismissed prematurely due to our enthusiasm to enter our Candidate Recommendation stage. What I would ask of the chairs and my fellow participants in the WG is to consider the difficulties of industry and community fragmentation that could potentially arise in the event that specific wrapper formats and hash algorithms do not immediately become dominant among stakeholders, and how we can minimise that risk ourselves by making rdf-canon as well-specified as possible before entering CR. Best wishes, Sebastian
Received on Tuesday, 12 September 2023 21:04:01 UTC