Re: Issues with parameterized hashing algorithms used internally from Pierre-Antoine Champin on 2023-09-14 (public-rch-wg@w3.org from September 2023)

From: Pierre-Antoine Champin <pierre-antoine@w3.org>
Date: Thu, 14 Sep 2023 12:56:59 +0200
To: Sebastian Crane <seabass-labrax@gmx.com>, public-rch-wg@w3.org
Message-ID: <e6c8571b-7116-1262-fae2-d37935a35528@w3.org>
On 12/09/2023 23:03, Sebastian Crane wrote:
> (...)
>
> What is lost, however, is utility of rdf-canon to the Semantic Web and
> Linked Data ecosystem. The two general use-cases for data
> canonicalisation are A: identifying duplicate datasets, and B:
> creating a unique and consistent identifier of a dataset, from that
> dataset itself.
>
> (...)
>
> Situation B is what the recent resolution threatens. Because the choice
> of internal hashing algorithm (again, used for disambiguation, not
> security) affects the result, there are as many possible identifiers of
> a dataset as there are hashing algorithms. Not such a unique value now!

I don't see how this is different from the situation we have with hash 
values for checking integrity of files.

If you only give me a file and *some* hash value, it is useless. You 
also need to tell me which hash function you used to compute that hash.

For RDF-datasets, this is more complex, and was already more complex 
before this decision from the group: you needed to specify 1) which c14n 
function you had use (URDNA2015, RDFC-1.0) and which hash function you 
used on the output.

With the proposed change, we have one more moving part (which is meant 
as a feature, not a bug), but this does not qualitatively change the 
requirements to provide adequate metadata with a hash value to make it 
usable.

>
> If you receive a value described simply as a canonicalized hash of a
> given RDF dataset, and would like to reproduce that value from the
> dataset, you have no idea which hashing algorithm was used
> internally. You must run a brute-force check for each possible hashing
> algorithm that exists.
>
>
> - It harms interoperability, as implementations will need to support
> multiple internal hashing algorithms. Even 'fully conformant'
> implementations may simply fail to succeed if they do not implement, for
> instance, a newer hashing algorithm.

If people that can't live with SHA-256 were to reinvent their own 
version of rdf-canon from scratch, this would hurt interoperability even 
more.

> - It indirectly harms security, as these implementations will have a
> larger attack surface area - not a risk specifically to do with hashing
> as a computation, but simply because larger codebases have a greater
> risk of security-critical vulnerabilities.

My implementation, and I suspect others as well, relies on a library of 
hash functions that provides a bunch of them. This change does not 
significantly change the size of the codebase.

> - It harms performance and energy-efficiency, because all the datasets'
> blank nodes (a quantity often expressed in orders of magnitude) must be
> hashed repeatedly with different algorithms.
I don't get that argument. I expect that each application ecosystem will 
chose /one/ hash function that works for them.
> - It harms ease of implementation, since some resource-constrained
> devices simply do not have the available capacity to have tens of hash
> algorithms installed.
The spec only requires two.
> RDF is valuable in embedded and edge computing
> contexts, but this resolution may jeopardise this use-case.
>
>
> I hope it is clear that the change harms, to a lesser or greater extent,
> almost every aspect of rdf-canon, in return for letting us avoid a
> mostly arbitrary decision of what non-security-critical hashing
> algorithm to use.

My feeling is that you are overemphasizing the harms.

That being said, +1 to try and prevent them, by (for example):

- Add some text about the need to provide sufficient metadata with the 
final hash to make it usable (e.g. by using multibase).
- Coin standard IRIs for identifying the standard "processing chains" 
from RDF dataset to hash value (i.e. "RDFC 1.0 using SHA-256, then 
hashed with SHA-256", "RDFC 1.0 using SHA-384, then hashed with SHA-384" 
), that external methods for conveying the metadata (e.g. VC) could use.
- Make SHA-256 the "official default" function, and add some guidance 
about interoperability ("do not use another hash function unless 
strongly required to").

My 2¢

   pa

>
> There are specification such as as the 'multibase' family of formats
> which would allow users to annotate the hash values, addressing the
> performance problem and most of the interoperability concern. However,
> even this partial solution only works outside of rdf-canon; as I alluded
> to earlier, it means that rdf-canon will become effectively useless for
> my 'scenario B' without a format like multibase to wrap the
> result. Likewise, use of rdf-canon inside Verifiable Credentials may
> address some of the issues due to the richer metadata that VCs
> provide. This is metadata that does not need to exist, though, if we
> simply make an informed decision and choose one hash algorithm.
>
> I am more than happy to respond to any queries about the issues which I
> have raised above. I believe that many of them have already been raised
> by various participants in prior RDF Canonicalization and Hashing
> Working Group meetings, but have been dismissed prematurely due to our
> enthusiasm to enter our Candidate Recommendation stage.
>
> What I would ask of the chairs and my fellow participants in the WG is
> to consider the difficulties of industry and community fragmentation
> that could potentially arise in the event that specific wrapper formats
> and hash algorithms do not immediately become dominant among
> stakeholders, and how we can minimise that risk ourselves by making
> rdf-canon as well-specified as possible before entering CR.
>
> Best wishes,
>
> Sebastian
>
Attachments

application/pgp-keys attachment: OpenPGP public key
Received on Thursday, 14 September 2023 10:57:04 UTC