Re: Issues with parameterized hashing algorithms used internally from Dan Yamamoto on 2023-09-14 (public-rch-wg@w3.org from September 2023)

From: Dan Yamamoto <dan@iij.ad.jp>
Date: Fri, 15 Sep 2023 00:24:36 +0900
To: public-rch-wg@w3.org
Message-ID: <26c37797-969f-c63f-984c-ccc7337df9aa@iij.ad.jp>
Thank you, Sebastian and all.

Through the above discussion, I've reconsidered in greater detail what 
ultimately happens to canonicalization if the "internal" hash function 
becomes unsafe. From my understanding, an insecure hash function can 
cause collisions, meaning it outputs the same hash value for different 
inputs. This leads to ties in for-loops based on the hash value's code 
point order, e.g., steps 4, 5, and 5.3 in the Canonicalization Algorithm 
(section 4.4.3). These ties can't be deterministically resolved, making 
the outcome indeterminate.
While this might not pose a security threat in many use cases, for 
instances like VC where canonicalization is used as a preprocessing step 
for signing, if the result of canonicalization changes depending on the 
runtime environment or input blank node labels, the resulting signature 
might vary in its verification success. This could compromise the 
correctness of the signature system. Therefore, I believe the internal 
hash function should be interchangeable. However, as others have 
suggested, I think there is a need to introduce a mechanism to specify 
what hash function is used explicitly.

Dan

On 2023/09/14 19:56, Pierre-Antoine Champin wrote:
> On 12/09/2023 23:03, Sebastian Crane wrote:
>> (...)
>>
>> What is lost, however, is utility of rdf-canon to the Semantic Web and
>> Linked Data ecosystem. The two general use-cases for data
>> canonicalisation are A: identifying duplicate datasets, and B:
>> creating a unique and consistent identifier of a dataset, from that
>> dataset itself.
>>
>> (...)
>>
>> Situation B is what the recent resolution threatens. Because the choice
>> of internal hashing algorithm (again, used for disambiguation, not
>> security) affects the result, there are as many possible identifiers of
>> a dataset as there are hashing algorithms. Not such a unique value now!
> 
> I don't see how this is different from the situation we have with hash 
> values for checking integrity of files.
> 
> If you only give me a file and *some* hash value, it is useless. You 
> also need to tell me which hash function you used to compute that hash.
> 
> For RDF-datasets, this is more complex, and was already more complex 
> before this decision from the group: you needed to specify 1) which c14n 
> function you had use (URDNA2015, RDFC-1.0) and which hash function you 
> used on the output.
> 
> With the proposed change, we have one more moving part (which is meant 
> as a feature, not a bug), but this does not qualitatively change the 
> requirements to provide adequate metadata with a hash value to make it 
> usable.
> 
>> If you receive a value described simply as a canonicalized hash of a
>> given RDF dataset, and would like to reproduce that value from the
>> dataset, you have no idea which hashing algorithm was used
>> internally. You must run a brute-force check for each possible hashing
>> algorithm that exists.
>>
>>
>> - It harms interoperability, as implementations will need to support
>> multiple internal hashing algorithms. Even 'fully conformant'
>> implementations may simply fail to succeed if they do not implement, for
>> instance, a newer hashing algorithm.
> 
> If people that can't live with SHA-256 were to reinvent their own 
> version of rdf-canon from scratch, this would hurt interoperability even 
> more.
> 
>> - It indirectly harms security, as these implementations will have a
>> larger attack surface area - not a risk specifically to do with hashing
>> as a computation, but simply because larger codebases have a greater
>> risk of security-critical vulnerabilities.
> 
> My implementation, and I suspect others as well, relies on a library of 
> hash functions that provides a bunch of them. This change does not 
> significantly change the size of the codebase.
> 
>> - It harms performance and energy-efficiency, because all the datasets'
>> blank nodes (a quantity often expressed in orders of magnitude) must be
>> hashed repeatedly with different algorithms.
> I don't get that argument. I expect that each application ecosystem will 
> chose /one/ hash function that works for them.
>> - It harms ease of implementation, since some resource-constrained
>> devices simply do not have the available capacity to have tens of hash
>> algorithms installed.
> The spec only requires two.
>> RDF is valuable in embedded and edge computing
>> contexts, but this resolution may jeopardise this use-case.
>>
>>
>> I hope it is clear that the change harms, to a lesser or greater extent,
>> almost every aspect of rdf-canon, in return for letting us avoid a
>> mostly arbitrary decision of what non-security-critical hashing
>> algorithm to use.
> 
> My feeling is that you are overemphasizing the harms.
> 
> That being said, +1 to try and prevent them, by (for example):
> 
> - Add some text about the need to provide sufficient metadata with the 
> final hash to make it usable (e.g. by using multibase).
> - Coin standard IRIs for identifying the standard "processing chains" 
> from RDF dataset to hash value (i.e. "RDFC 1.0 using SHA-256, then 
> hashed with SHA-256", "RDFC 1.0 using SHA-384, then hashed with SHA-384" 
> ), that external methods for conveying the metadata (e.g. VC) could use.
> - Make SHA-256 the "official default" function, and add some guidance 
> about interoperability ("do not use another hash function unless 
> strongly required to").
> 
> My 2¢
> 
>    pa
> 
>> There are specification such as as the 'multibase' family of formats
>> which would allow users to annotate the hash values, addressing the
>> performance problem and most of the interoperability concern. However,
>> even this partial solution only works outside of rdf-canon; as I alluded
>> to earlier, it means that rdf-canon will become effectively useless for
>> my 'scenario B' without a format like multibase to wrap the
>> result. Likewise, use of rdf-canon inside Verifiable Credentials may
>> address some of the issues due to the richer metadata that VCs
>> provide. This is metadata that does not need to exist, though, if we
>> simply make an informed decision and choose one hash algorithm.
>>
>> I am more than happy to respond to any queries about the issues which I
>> have raised above. I believe that many of them have already been raised
>> by various participants in prior RDF Canonicalization and Hashing
>> Working Group meetings, but have been dismissed prematurely due to our
>> enthusiasm to enter our Candidate Recommendation stage.
>>
>> What I would ask of the chairs and my fellow participants in the WG is
>> to consider the difficulties of industry and community fragmentation
>> that could potentially arise in the event that specific wrapper formats
>> and hash algorithms do not immediately become dominant among
>> stakeholders, and how we can minimise that risk ourselves by making
>> rdf-canon as well-specified as possible before entering CR.
>>
>> Best wishes,
>>
>> Sebastian
>>
Received on Thursday, 14 September 2023 15:24:50 UTC