RE: Issues with parameterized hashing algorithms used internally

Hi Sebastian,

Would you be able to have a stab at writing a Pull Request to address the issue? If it can be done - and I emphasize 'if' - then it would be good to get the discussion well advanced before we meet next week.

WDYT?

Thanks

Phil

---

Phil Archer
Web Solutions Director, GS1
https://www.gs1.org


https://philarcher.org

+44 (0)7887 767755
https://mastodon.social/@PhilA


-----Original Message-----
From: Dan Yamamoto <dan@iij.ad.jp>
Sent: Thursday, September 14, 2023 4:25 PM
To: public-rch-wg@w3.org
Subject: Re: Issues with parameterized hashing algorithms used internally

Thank you, Sebastian and all.

Through the above discussion, I've reconsidered in greater detail what ultimately happens to canonicalization if the "internal" hash function becomes unsafe. From my understanding, an insecure hash function can cause collisions, meaning it outputs the same hash value for different inputs. This leads to ties in for-loops based on the hash value's code point order, e.g., steps 4, 5, and 5.3 in the Canonicalization Algorithm (section 4.4.3). These ties can't be deterministically resolved, making the outcome indeterminate.
While this might not pose a security threat in many use cases, for instances like VC where canonicalization is used as a preprocessing step for signing, if the result of canonicalization changes depending on the runtime environment or input blank node labels, the resulting signature might vary in its verification success. This could compromise the correctness of the signature system. Therefore, I believe the internal hash function should be interchangeable. However, as others have suggested, I think there is a need to introduce a mechanism to specify what hash function is used explicitly.

Dan

On 2023/09/14 19:56, Pierre-Antoine Champin wrote:
> On 12/09/2023 23:03, Sebastian Crane wrote:
>> (...)
>>
>> What is lost, however, is utility of rdf-canon to the Semantic Web
>> and Linked Data ecosystem. The two general use-cases for data
>> canonicalisation are A: identifying duplicate datasets, and B:
>> creating a unique and consistent identifier of a dataset, from that
>> dataset itself.
>>
>> (...)
>>
>> Situation B is what the recent resolution threatens. Because the
>> choice of internal hashing algorithm (again, used for disambiguation,
>> not
>> security) affects the result, there are as many possible identifiers
>> of a dataset as there are hashing algorithms. Not such a unique value now!
>
> I don't see how this is different from the situation we have with hash
> values for checking integrity of files.
>
> If you only give me a file and *some* hash value, it is useless. You
> also need to tell me which hash function you used to compute that hash.
>
> For RDF-datasets, this is more complex, and was already more complex
> before this decision from the group: you needed to specify 1) which
> c14n function you had use (URDNA2015, RDFC-1.0) and which hash
> function you used on the output.
>
> With the proposed change, we have one more moving part (which is meant
> as a feature, not a bug), but this does not qualitatively change the
> requirements to provide adequate metadata with a hash value to make it
> usable.
>
>> If you receive a value described simply as a canonicalized hash of a
>> given RDF dataset, and would like to reproduce that value from the
>> dataset, you have no idea which hashing algorithm was used
>> internally. You must run a brute-force check for each possible
>> hashing algorithm that exists.
>>
>>
>> - It harms interoperability, as implementations will need to support
>> multiple internal hashing algorithms. Even 'fully conformant'
>> implementations may simply fail to succeed if they do not implement,
>> for instance, a newer hashing algorithm.
>
> If people that can't live with SHA-256 were to reinvent their own
> version of rdf-canon from scratch, this would hurt interoperability
> even more.
>
>> - It indirectly harms security, as these implementations will have a
>> larger attack surface area - not a risk specifically to do with
>> hashing as a computation, but simply because larger codebases have a
>> greater risk of security-critical vulnerabilities.
>
> My implementation, and I suspect others as well, relies on a library
> of hash functions that provides a bunch of them. This change does not
> significantly change the size of the codebase.
>
>> - It harms performance and energy-efficiency, because all the datasets'
>> blank nodes (a quantity often expressed in orders of magnitude) must
>> be hashed repeatedly with different algorithms.
> I don't get that argument. I expect that each application ecosystem
> will chose /one/ hash function that works for them.
>> - It harms ease of implementation, since some resource-constrained
>> devices simply do not have the available capacity to have tens of
>> hash algorithms installed.
> The spec only requires two.
>> RDF is valuable in embedded and edge computing contexts, but this
>> resolution may jeopardise this use-case.
>>
>>
>> I hope it is clear that the change harms, to a lesser or greater
>> extent, almost every aspect of rdf-canon, in return for letting us
>> avoid a mostly arbitrary decision of what non-security-critical
>> hashing algorithm to use.
>
> My feeling is that you are overemphasizing the harms.
>
> That being said, +1 to try and prevent them, by (for example):
>
> - Add some text about the need to provide sufficient metadata with the
> final hash to make it usable (e.g. by using multibase).
> - Coin standard IRIs for identifying the standard "processing chains"
> from RDF dataset to hash value (i.e. "RDFC 1.0 using SHA-256, then
> hashed with SHA-256", "RDFC 1.0 using SHA-384, then hashed with SHA-384"
> ), that external methods for conveying the metadata (e.g. VC) could use.
> - Make SHA-256 the "official default" function, and add some guidance
> about interoperability ("do not use another hash function unless
> strongly required to").
>
> My 2ยข
>
>    pa
>
>> There are specification such as as the 'multibase' family of formats
>> which would allow users to annotate the hash values, addressing the
>> performance problem and most of the interoperability concern.
>> However, even this partial solution only works outside of rdf-canon;
>> as I alluded to earlier, it means that rdf-canon will become
>> effectively useless for my 'scenario B' without a format like
>> multibase to wrap the result. Likewise, use of rdf-canon inside
>> Verifiable Credentials may address some of the issues due to the
>> richer metadata that VCs provide. This is metadata that does not need
>> to exist, though, if we simply make an informed decision and choose one hash algorithm.
>>
>> I am more than happy to respond to any queries about the issues which
>> I have raised above. I believe that many of them have already been
>> raised by various participants in prior RDF Canonicalization and
>> Hashing Working Group meetings, but have been dismissed prematurely
>> due to our enthusiasm to enter our Candidate Recommendation stage.
>>
>> What I would ask of the chairs and my fellow participants in the WG
>> is to consider the difficulties of industry and community
>> fragmentation that could potentially arise in the event that specific
>> wrapper formats and hash algorithms do not immediately become
>> dominant among stakeholders, and how we can minimise that risk
>> ourselves by making rdf-canon as well-specified as possible before entering CR.
>>
>> Best wishes,
>>
>> Sebastian
>>

CONFIDENTIALITY / DISCLAIMER: The contents of this e-mail are  confidential and are not to be regarded as a contractual offer or acceptance from GS1 (registered in Belgium). 
If you are not the addressee, or if this has been copied or sent to you in error, you must not use data herein for any purpose, you must delete it, and should inform the sender. 
GS1 disclaims liability for accuracy or completeness, and opinions expressed are those of the author alone. 
GS1 may monitor communications. 
Third party rights acknowledged. 
(c) 2020.

Received on Monday, 18 September 2023 15:15:20 UTC