Re: Issues with parameterized hashing algorithms used internally

I’m inclined to agree with Sebastian. It may be easy for people to hear “hash function” and think of threats, but this is only true when hashing the N-Quads result. Hashing within the algorithm is really for disambiguating, and I dint understand the threat model. I see no reason that the hash function used within the algorithm needs to be the same as that used to hash the output.

However, I disagree that the parameterization, it self, hurts things, just that there may not really be a need to do this, other than government agencies forbidding any use of certain algorithms. 

Gregg Kellogg

Sent from my iPad

> On Sep 12, 2023, at 8:59 PM, Ivan Herman <ivan@w3.org> wrote:
> 
> Hi Sebastian,
> 
> thanks for this: it is indeed an important aspect that we did not discuss. Note that, while writing these lines, I realized that we already have a similar, but easier issue: it is not specified how the hash of the graph is returned, ie, using what base. That leads to similar issues as what you describe, doesn't it?
> 
> The reason this approach with parameterized hashing came up, and got voted, was an answer to a real issue out there, so we should be careful not to dismiss it entirely. So we should try to solve, additionally, the issue you raise. Actually, you give an answer yourself… What if we say (details to be worked out):
> 
> 1. We agree that the hash function used to hash the result of the canonicalization (after all, our WG is RCH not RC) _MUST_ be the same as what is used for the internals of the algorithm
> 2. We agree that the result of the hash function does not only return the hash value itself but the hash function used to obtain it. I am not in the best position to decide which approach to use for that; I see that, for example, using the SRI format[1], which essentially encodes the hash algorithm used and the hash itself in base64
> 
> Would that work?
> 
> Ivan
> 
> 
> [1] https://w3c.github.io/webappsec-subresource-integrity/#the-integrity-attribute
> 
>> On 12 Sep 2023, at 23:03, Sebastian Crane <seabass-labrax@gmx.com> wrote:
>> 
>> Dear all,
>> 
>> Although I was unable to attend the RDF Canonicalization and Hashing
>> Working Group meeting this week, I did review the minutes and am
>> disappointed by the resolution to allow users of the standard to
>> customize the hashing algorithm. I believe this limits considerably the
>> utility of the rdf-canon specification by itself, effectively making it
>> useless without a wrapping data format for some stakeholders.
>> 
>> First of all, I would like to clarify that the change regards the
>> internal hashing algorithm used in rdf-canon. This exists merely to
>> disambiguate blank nodes, and doesn't protect data security. In our
>> case, as for the discussions about one hashing algorithm being more
>> secure than another, required by certain industry security standards or
>> the hypothetical situation that an algorithm is broken, these do not
>> correspond to any risk of loss of security or privacy.
>> 
>> What is lost, however, is utility of rdf-canon to the Semantic Web and
>> Linked Data ecosystem. The two general use-cases for data
>> canonicalisation are A: identifying duplicate datasets, and B:
>> creating a unique and consistent identifier of a dataset, from that
>> dataset itself.
>> 
>> In situation A, users already have access to both copies of the dataset,
>> and can therefore choose whatever internal hashing algorithm they
>> like. There is little importance to this decision, but critically, it
>> can be made without harming interoperability or other factors.
>> 
>> Situation B is what the recent resolution threatens. Because the choice
>> of internal hashing algorithm (again, used for disambiguation, not
>> security) affects the result, there are as many possible identifiers of
>> a dataset as there are hashing algorithms. Not such a unique value now!
>> 
>> If you receive a value described simply as a canonicalized hash of a
>> given RDF dataset, and would like to reproduce that value from the
>> dataset, you have no idea which hashing algorithm was used
>> internally. You must run a brute-force check for each possible hashing
>> algorithm that exists.
>> 
>> 
>> - It harms interoperability, as implementations will need to support
>> multiple internal hashing algorithms. Even 'fully conformant'
>> implementations may simply fail to succeed if they do not implement, for
>> instance, a newer hashing algorithm.
>> 
>> 
>> - It indirectly harms security, as these implementations will have a
>> larger attack surface area - not a risk specifically to do with hashing
>> as a computation, but simply because larger codebases have a greater
>> risk of security-critical vulnerabilities.
>> 
>> 
>> - It harms performance and energy-efficiency, because all the datasets'
>> blank nodes (a quantity often expressed in orders of magnitude) must be
>> hashed repeatedly with different algorithms.
>> 
>> 
>> - It harms ease of implementation, since some resource-constrained
>> devices simply do not have the available capacity to have tens of hash
>> algorithms installed. RDF is valuable in embedded and edge computing
>> contexts, but this resolution may jeopardise this use-case.
>> 
>> 
>> I hope it is clear that the change harms, to a lesser or greater extent,
>> almost every aspect of rdf-canon, in return for letting us avoid a
>> mostly arbitrary decision of what non-security-critical hashing
>> algorithm to use.
>> 
>> There are specification such as as the 'multibase' family of formats
>> which would allow users to annotate the hash values, addressing the
>> performance problem and most of the interoperability concern. However,
>> even this partial solution only works outside of rdf-canon; as I alluded
>> to earlier, it means that rdf-canon will become effectively useless for
>> my 'scenario B' without a format like multibase to wrap the
>> result. Likewise, use of rdf-canon inside Verifiable Credentials may
>> address some of the issues due to the richer metadata that VCs
>> provide. This is metadata that does not need to exist, though, if we
>> simply make an informed decision and choose one hash algorithm.
>> 
>> I am more than happy to respond to any queries about the issues which I
>> have raised above. I believe that many of them have already been raised
>> by various participants in prior RDF Canonicalization and Hashing
>> Working Group meetings, but have been dismissed prematurely due to our
>> enthusiasm to enter our Candidate Recommendation stage.
>> 
>> What I would ask of the chairs and my fellow participants in the WG is
>> to consider the difficulties of industry and community fragmentation
>> that could potentially arise in the event that specific wrapper formats
>> and hash algorithms do not immediately become dominant among
>> stakeholders, and how we can minimise that risk ourselves by making
>> rdf-canon as well-specified as possible before entering CR.
>> 
>> Best wishes,
>> 
>> Sebastian
>> 
> 
> 
> ----
> Ivan Herman, W3C 
> Home: http://www.w3.org/People/Ivan/
> mobile: +33 6 52 46 00 43
> 
> 

Received on Wednesday, 13 September 2023 23:39:38 UTC