- From: Phil Archer <phil.archer@gs1.org>
- Date: Thu, 14 Sep 2023 08:23:31 +0000
- To: RDF Dataset Canonicalization and Hash Working Group <public-rch-wg@w3.org>, Pierre-Antoine Champin <pierre-antoine@w3.org>, Markus Sabadello <markus@danubetech.com>
Thanks Sebastian, Ivan and Gregg, This looks more than trivial in that we have general agreement that this is a real issue but there isn't immediate and full consensus on what to do about it. Satisfying this issue requires normative changes to the spec. That means we have two options: 1. Ignore the issue and carry on as we are with the spec we have. 2. Address the issue as quickly as we can and re-begin the CR transition after that. IMO, option 1 is not really an option and we need to expedite option 2. I believe it would be difficult to hold a WG meeting next Tuesday 19th. I am not available and, with the RWOT event in Germany, others may be unavailable too - and we all need a break after TPAC. Therefore, I think our next possible meeting would be Wednesday 27th. I have opened an issue for this in GitHub https://github.com/w3c/rdf-canon/issues/176. @Sebastian, would you be able to offer a PR please? If you're able to do that by, say, the middle of next week, others will have a chance to have a look at it, perhaps offer amendments, and, I hope, we can run a meeting on 27th with an agenda along the lines of: 1. Review PR XXX (will fix Issue 176) 2. Review readiness of explainer and, if appropriate, resolve to publish 3. Review spec in the light of these two and, if appropriate, resolve (again) to seek transition to CR. @Markus - do you agree? Phil --- Phil Archer Web Solutions Director, GS1 https://www.gs1.org https://philarcher.org +44 (0)7887 767755 https://mastodon.social/@PhilA -----Original Message----- From: Gregg Kellogg <gregg@greggkellogg.com> Sent: Thursday, September 14, 2023 12:39 AM To: Ivan Herman <ivan@w3.org> Cc: Sebastian Crane <seabass-labrax@gmx.com>; public-rch-wg@w3.org Subject: Re: Issues with parameterized hashing algorithms used internally I’m inclined to agree with Sebastian. It may be easy for people to hear “hash function” and think of threats, but this is only true when hashing the N-Quads result. Hashing within the algorithm is really for disambiguating, and I dint understand the threat model. I see no reason that the hash function used within the algorithm needs to be the same as that used to hash the output. However, I disagree that the parameterization, it self, hurts things, just that there may not really be a need to do this, other than government agencies forbidding any use of certain algorithms. Gregg Kellogg Sent from my iPad On Sep 12, 2023, at 8:59 PM, Ivan Herman <ivan@w3.org> wrote: Hi Sebastian, thanks for this: it is indeed an important aspect that we did not discuss. Note that, while writing these lines, I realized that we already have a similar, but easier issue: it is not specified how the hash of the graph is returned, ie, using what base. That leads to similar issues as what you describe, doesn't it? The reason this approach with parameterized hashing came up, and got voted, was an answer to a real issue out there, so we should be careful not to dismiss it entirely. So we should try to solve, additionally, the issue you raise. Actually, you give an answer yourself… What if we say (details to be worked out): 1. We agree that the hash function used to hash the result of the canonicalization (after all, our WG is RCH not RC) _MUST_ be the same as what is used for the internals of the algorithm 2. We agree that the result of the hash function does not only return the hash value itself but the hash function used to obtain it. I am not in the best position to decide which approach to use for that; I see that, for example, using the SRI format[1], which essentially encodes the hash algorithm used and the hash itself in base64 Would that work? Ivan [1] https://w3c.github.io/webappsec-subresource-integrity/#the-integrity-attribute On 12 Sep 2023, at 23:03, Sebastian Crane <seabass-labrax@gmx.com> wrote: Dear all, Although I was unable to attend the RDF Canonicalization and Hashing Working Group meeting this week, I did review the minutes and am disappointed by the resolution to allow users of the standard to customize the hashing algorithm. I believe this limits considerably the utility of the rdf-canon specification by itself, effectively making it useless without a wrapping data format for some stakeholders. First of all, I would like to clarify that the change regards the internal hashing algorithm used in rdf-canon. This exists merely to disambiguate blank nodes, and doesn't protect data security. In our case, as for the discussions about one hashing algorithm being more secure than another, required by certain industry security standards or the hypothetical situation that an algorithm is broken, these do not correspond to any risk of loss of security or privacy. What is lost, however, is utility of rdf-canon to the Semantic Web and Linked Data ecosystem. The two general use-cases for data canonicalisation are A: identifying duplicate datasets, and B: creating a unique and consistent identifier of a dataset, from that dataset itself. In situation A, users already have access to both copies of the dataset, and can therefore choose whatever internal hashing algorithm they like. There is little importance to this decision, but critically, it can be made without harming interoperability or other factors. Situation B is what the recent resolution threatens. Because the choice of internal hashing algorithm (again, used for disambiguation, not security) affects the result, there are as many possible identifiers of a dataset as there are hashing algorithms. Not such a unique value now! If you receive a value described simply as a canonicalized hash of a given RDF dataset, and would like to reproduce that value from the dataset, you have no idea which hashing algorithm was used internally. You must run a brute-force check for each possible hashing algorithm that exists. - It harms interoperability, as implementations will need to support multiple internal hashing algorithms. Even 'fully conformant' implementations may simply fail to succeed if they do not implement, for instance, a newer hashing algorithm. - It indirectly harms security, as these implementations will have a larger attack surface area - not a risk specifically to do with hashing as a computation, but simply because larger codebases have a greater risk of security-critical vulnerabilities. - It harms performance and energy-efficiency, because all the datasets' blank nodes (a quantity often expressed in orders of magnitude) must be hashed repeatedly with different algorithms. - It harms ease of implementation, since some resource-constrained devices simply do not have the available capacity to have tens of hash algorithms installed. RDF is valuable in embedded and edge computing contexts, but this resolution may jeopardise this use-case. I hope it is clear that the change harms, to a lesser or greater extent, almost every aspect of rdf-canon, in return for letting us avoid a mostly arbitrary decision of what non-security-critical hashing algorithm to use. There are specification such as as the 'multibase' family of formats which would allow users to annotate the hash values, addressing the performance problem and most of the interoperability concern. However, even this partial solution only works outside of rdf-canon; as I alluded to earlier, it means that rdf-canon will become effectively useless for my 'scenario B' without a format like multibase to wrap the result. Likewise, use of rdf-canon inside Verifiable Credentials may address some of the issues due to the richer metadata that VCs provide. This is metadata that does not need to exist, though, if we simply make an informed decision and choose one hash algorithm. I am more than happy to respond to any queries about the issues which I have raised above. I believe that many of them have already been raised by various participants in prior RDF Canonicalization and Hashing Working Group meetings, but have been dismissed prematurely due to our enthusiasm to enter our Candidate Recommendation stage. What I would ask of the chairs and my fellow participants in the WG is to consider the difficulties of industry and community fragmentation that could potentially arise in the event that specific wrapper formats and hash algorithms do not immediately become dominant among stakeholders, and how we can minimise that risk ourselves by making rdf-canon as well-specified as possible before entering CR. Best wishes, Sebastian ---- Ivan Herman, W3C Home: http://www.w3.org/People/Ivan/ mobile: +33 6 52 46 00 43 CONFIDENTIALITY / DISCLAIMER: The contents of this e-mail are confidential and are not to be regarded as a contractual offer or acceptance from GS1 (registered in Belgium). If you are not the addressee, or if this has been copied or sent to you in error, you must not use data herein for any purpose, you must delete it, and should inform the sender. GS1 disclaims liability for accuracy or completeness, and opinions expressed are those of the author alone. GS1 may monitor communications. Third party rights acknowledged. (c) 2020.
Received on Thursday, 14 September 2023 08:23:53 UTC