Issues with parameterized hashing algorithms used internally

Dear all,

Although I was unable to attend the RDF Canonicalization and Hashing
Working Group meeting this week, I did review the minutes and am
disappointed by the resolution to allow users of the standard to
customize the hashing algorithm. I believe this limits considerably the
utility of the rdf-canon specification by itself, effectively making it
useless without a wrapping data format for some stakeholders.

First of all, I would like to clarify that the change regards the
internal hashing algorithm used in rdf-canon. This exists merely to
disambiguate blank nodes, and doesn't protect data security. In our
case, as for the discussions about one hashing algorithm being more
secure than another, required by certain industry security standards or
the hypothetical situation that an algorithm is broken, these do not
correspond to any risk of loss of security or privacy.

What is lost, however, is utility of rdf-canon to the Semantic Web and
Linked Data ecosystem. The two general use-cases for data
canonicalisation are A: identifying duplicate datasets, and B:
creating a unique and consistent identifier of a dataset, from that
dataset itself.

In situation A, users already have access to both copies of the dataset,
and can therefore choose whatever internal hashing algorithm they
like. There is little importance to this decision, but critically, it
can be made without harming interoperability or other factors.

Situation B is what the recent resolution threatens. Because the choice
of internal hashing algorithm (again, used for disambiguation, not
security) affects the result, there are as many possible identifiers of
a dataset as there are hashing algorithms. Not such a unique value now!

If you receive a value described simply as a canonicalized hash of a
given RDF dataset, and would like to reproduce that value from the
dataset, you have no idea which hashing algorithm was used
internally. You must run a brute-force check for each possible hashing
algorithm that exists.


- It harms interoperability, as implementations will need to support
multiple internal hashing algorithms. Even 'fully conformant'
implementations may simply fail to succeed if they do not implement, for
instance, a newer hashing algorithm.


- It indirectly harms security, as these implementations will have a
larger attack surface area - not a risk specifically to do with hashing
as a computation, but simply because larger codebases have a greater
risk of security-critical vulnerabilities.


- It harms performance and energy-efficiency, because all the datasets'
blank nodes (a quantity often expressed in orders of magnitude) must be
hashed repeatedly with different algorithms.


- It harms ease of implementation, since some resource-constrained
devices simply do not have the available capacity to have tens of hash
algorithms installed. RDF is valuable in embedded and edge computing
contexts, but this resolution may jeopardise this use-case.


I hope it is clear that the change harms, to a lesser or greater extent,
almost every aspect of rdf-canon, in return for letting us avoid a
mostly arbitrary decision of what non-security-critical hashing
algorithm to use.

There are specification such as as the 'multibase' family of formats
which would allow users to annotate the hash values, addressing the
performance problem and most of the interoperability concern. However,
even this partial solution only works outside of rdf-canon; as I alluded
to earlier, it means that rdf-canon will become effectively useless for
my 'scenario B' without a format like multibase to wrap the
result. Likewise, use of rdf-canon inside Verifiable Credentials may
address some of the issues due to the richer metadata that VCs
provide. This is metadata that does not need to exist, though, if we
simply make an informed decision and choose one hash algorithm.

I am more than happy to respond to any queries about the issues which I
have raised above. I believe that many of them have already been raised
by various participants in prior RDF Canonicalization and Hashing
Working Group meetings, but have been dismissed prematurely due to our
enthusiasm to enter our Candidate Recommendation stage.

What I would ask of the chairs and my fellow participants in the WG is
to consider the difficulties of industry and community fragmentation
that could potentially arise in the event that specific wrapper formats
and hash algorithms do not immediately become dominant among
stakeholders, and how we can minimise that risk ourselves by making
rdf-canon as well-specified as possible before entering CR.

Best wishes,

Sebastian

Received on Tuesday, 12 September 2023 21:04:01 UTC