RE: Issues with parameterized hashing algorithms used internally from Phil Archer on 2023-09-14 (public-rch-wg@w3.org from September 2023)

From: Phil Archer <phil.archer@gs1.org>
Date: Thu, 14 Sep 2023 08:23:31 +0000
To: RDF Dataset Canonicalization and Hash Working Group <public-rch-wg@w3.org>, Pierre-Antoine Champin <pierre-antoine@w3.org>, Markus Sabadello <markus@danubetech.com>
Message-ID: <DS0PR08MB9035EC68E1F4984405F8D4F4B7F7A@DS0PR08MB9035.namprd08.prod.outlook.com>
Thanks Sebastian, Ivan and Gregg,

This looks more than trivial in that we have general agreement that this is a real issue but there isn't immediate and full consensus on what to do about it. Satisfying this issue requires normative changes to the spec. That means we have two options:

1. Ignore the issue and carry on as we are with the spec we have.
2. Address the issue as quickly as we can and re-begin the CR transition after that.

IMO, option 1 is not really an option and we need to expedite option 2.

I believe it would be difficult to hold a WG meeting next Tuesday 19th. I am not available and, with the RWOT event in Germany, others may be unavailable too - and we all need a break after TPAC. Therefore, I think our next possible meeting would be Wednesday 27th.

I have opened an issue for this in GitHub https://github.com/w3c/rdf-canon/issues/176.


@Sebastian, would you be able to offer a PR please? If you're able to do that by, say, the middle of next week, others will have a chance to have a look at it, perhaps offer amendments, and, I hope, we can run a meeting on 27th with an agenda along the lines of:

1. Review PR XXX (will fix Issue 176)
2. Review readiness of explainer and, if appropriate, resolve to publish
3. Review spec in the light of these two and, if appropriate, resolve (again) to seek transition to CR.

@Markus - do you agree?

Phil



---

Phil Archer
Web Solutions Director, GS1
https://www.gs1.org


https://philarcher.org

+44 (0)7887 767755
https://mastodon.social/@PhilA


-----Original Message-----
From: Gregg Kellogg <gregg@greggkellogg.com>
Sent: Thursday, September 14, 2023 12:39 AM
To: Ivan Herman <ivan@w3.org>
Cc: Sebastian Crane <seabass-labrax@gmx.com>; public-rch-wg@w3.org
Subject: Re: Issues with parameterized hashing algorithms used internally

I’m inclined to agree with Sebastian. It may be easy for people to hear “hash function” and think of threats, but this is only true when hashing the N-Quads result. Hashing within the algorithm is really for disambiguating, and I dint understand the threat model. I see no reason that the hash function used within the algorithm needs to be the same as that used to hash the output.

However, I disagree that the parameterization, it self, hurts things, just that there may not really be a need to do this, other than government agencies forbidding any use of certain algorithms.


Gregg Kellogg

Sent from my iPad


        On Sep 12, 2023, at 8:59 PM, Ivan Herman <ivan@w3.org> wrote:



        Hi Sebastian,

        thanks for this: it is indeed an important aspect that we did not discuss. Note that, while writing these lines, I realized that we already have a similar, but easier issue: it is not specified how the hash of the graph is returned, ie, using what base. That leads to similar issues as what you describe, doesn't it?

        The reason this approach with parameterized hashing came up, and got voted, was an answer to a real issue out there, so we should be careful not to dismiss it entirely. So we should try to solve, additionally, the issue you raise. Actually, you give an answer yourself… What if we say (details to be worked out):

        1. We agree that the hash function used to hash the result of the canonicalization (after all, our WG is RCH not RC) _MUST_ be the same as what is used for the internals of the algorithm
        2. We agree that the result of the hash function does not only return the hash value itself but the hash function used to obtain it. I am not in the best position to decide which approach to use for that; I see that, for example, using the SRI format[1], which essentially encodes the hash algorithm used and the hash itself in base64

        Would that work?

        Ivan


        [1] https://w3c.github.io/webappsec-subresource-integrity/#the-integrity-attribute




                On 12 Sep 2023, at 23:03, Sebastian Crane <seabass-labrax@gmx.com> wrote:

                Dear all,

                Although I was unable to attend the RDF Canonicalization and Hashing
                Working Group meeting this week, I did review the minutes and am
                disappointed by the resolution to allow users of the standard to
                customize the hashing algorithm. I believe this limits considerably the
                utility of the rdf-canon specification by itself, effectively making it
                useless without a wrapping data format for some stakeholders.

                First of all, I would like to clarify that the change regards the
                internal hashing algorithm used in rdf-canon. This exists merely to
                disambiguate blank nodes, and doesn't protect data security. In our
                case, as for the discussions about one hashing algorithm being more
                secure than another, required by certain industry security standards or
                the hypothetical situation that an algorithm is broken, these do not
                correspond to any risk of loss of security or privacy.

                What is lost, however, is utility of rdf-canon to the Semantic Web and
                Linked Data ecosystem. The two general use-cases for data
                canonicalisation are A: identifying duplicate datasets, and B:
                creating a unique and consistent identifier of a dataset, from that
                dataset itself.

                In situation A, users already have access to both copies of the dataset,
                and can therefore choose whatever internal hashing algorithm they
                like. There is little importance to this decision, but critically, it
                can be made without harming interoperability or other factors.

                Situation B is what the recent resolution threatens. Because the choice
                of internal hashing algorithm (again, used for disambiguation, not
                security) affects the result, there are as many possible identifiers of
                a dataset as there are hashing algorithms. Not such a unique value now!

                If you receive a value described simply as a canonicalized hash of a
                given RDF dataset, and would like to reproduce that value from the
                dataset, you have no idea which hashing algorithm was used
                internally. You must run a brute-force check for each possible hashing
                algorithm that exists.


                - It harms interoperability, as implementations will need to support
                multiple internal hashing algorithms. Even 'fully conformant'
                implementations may simply fail to succeed if they do not implement, for
                instance, a newer hashing algorithm.


                - It indirectly harms security, as these implementations will have a
                larger attack surface area - not a risk specifically to do with hashing
                as a computation, but simply because larger codebases have a greater
                risk of security-critical vulnerabilities.


                - It harms performance and energy-efficiency, because all the datasets'
                blank nodes (a quantity often expressed in orders of magnitude) must be
                hashed repeatedly with different algorithms.


                - It harms ease of implementation, since some resource-constrained
                devices simply do not have the available capacity to have tens of hash
                algorithms installed. RDF is valuable in embedded and edge computing
                contexts, but this resolution may jeopardise this use-case.


                I hope it is clear that the change harms, to a lesser or greater extent,
                almost every aspect of rdf-canon, in return for letting us avoid a
                mostly arbitrary decision of what non-security-critical hashing
                algorithm to use.

                There are specification such as as the 'multibase' family of formats
                which would allow users to annotate the hash values, addressing the
                performance problem and most of the interoperability concern. However,
                even this partial solution only works outside of rdf-canon; as I alluded
                to earlier, it means that rdf-canon will become effectively useless for
                my 'scenario B' without a format like multibase to wrap the
                result. Likewise, use of rdf-canon inside Verifiable Credentials may
                address some of the issues due to the richer metadata that VCs
                provide. This is metadata that does not need to exist, though, if we
                simply make an informed decision and choose one hash algorithm.

                I am more than happy to respond to any queries about the issues which I
                have raised above. I believe that many of them have already been raised
                by various participants in prior RDF Canonicalization and Hashing
                Working Group meetings, but have been dismissed prematurely due to our
                enthusiasm to enter our Candidate Recommendation stage.

                What I would ask of the chairs and my fellow participants in the WG is
                to consider the difficulties of industry and community fragmentation
                that could potentially arise in the event that specific wrapper formats
                and hash algorithms do not immediately become dominant among
                stakeholders, and how we can minimise that risk ourselves by making
                rdf-canon as well-specified as possible before entering CR.

                Best wishes,

                Sebastian





        ----
        Ivan Herman, W3C
        Home: http://www.w3.org/People/Ivan/

        mobile: +33 6 52 46 00 43

CONFIDENTIALITY / DISCLAIMER: The contents of this e-mail are  confidential and are not to be regarded as a contractual offer or acceptance from GS1 (registered in Belgium). 
If you are not the addressee, or if this has been copied or sent to you in error, you must not use data herein for any purpose, you must delete it, and should inform the sender. 
GS1 disclaims liability for accuracy or completeness, and opinions expressed are those of the author alone. 
GS1 may monitor communications. 
Third party rights acknowledged. 
(c) 2020.
Received on Thursday, 14 September 2023 08:23:53 UTC