Re: Thoughts on the LDS WG chartering discussion from Eric Prud'hommeaux on 2021-06-15 (semantic-web@w3.org from June 2021)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Tue, 15 Jun 2021 13:58:06 +0200
To: David Booth <david@dbooth.org>
Cc: semantic-web@w3.org
Message-ID: <20210615115806.GN8464@w3.org>
On Mon, Jun 14, 2021 at 12:33:05AM -0400, David Booth wrote:
> Hi Eric,
> 
> On 6/13/21 4:59 AM, Eric Prud'hommeaux wrote:
> > On Sat, Jun 12, 2021 at 10:10:35PM -0400, David Booth wrote:
> > > On 6/12/21 11:34 AM, Eric Prud'hommeaux wrote:
> > > > . . .
> > > > One could implement URDNA without ever serializing N-Quads with a
> > > > hashCode() function on Quad, BNode, IRI and Literal: . . . >
> > > > [[
> > > > Hash Quad.urdna (h: Hasher) {
> > > >     return h.add(
> > > >       g.urdna(h), h(" "),
> > > >       s.urdna(h), h(" "),
> > > >       p.urdna(h), h(" "),
> > > >       o.urdna(h), " .\n"
> > > >     );
> > > > }
> > > > 
> > > > Hash IRI.urdna (h: Hasher) {
> > > >     return h.add('<', value, '>');
> > > > }
> > > > 
> > > > Hash BNode.urdna (h: Hasher) {
> > > >     static int ordinal = 1;
> > > >     return h.add("_:c14n_" + ordinal++);
> > > > }
> > > > ]]
> > > > 
> > > > Literal elided, and you'd want conditional code around Quad.g, but you
> > > > get the idea. Note that this approach means you never need to
> > > > duplicate your graph in memory as you do when constructing a
> > > > potentially giant N-Quads string.
> > > 
> > > If I'm understanding your intent, you are essentially showing that the hash
> > > can be computed on the fly as a serialization is produced piecemeal, without
> > > ever materializing the whole serialization at once.
> > 
> > That's correct. Such an optimization would allow you to operate in
> > considerably less memory (and exemplifies Aidan's point that this is
> > an operation over an abstract RDF graph).
> > 
> > 
> > > That could be done, but that is exactly what I would advise AGAINST doing,
> > > because in essence you are defining a home grown hashing algorithm, even if
> > > your algorithm makes use of an off-the-shelf hashing algorithm as a
> > > component.  From a security perspective, I think it is unwise to do that.
> > 
> > But it's NOT a home-grown hashing algorithm; it's URDNA (with BNOde
> > ordering oversimplified for brevity). I was just pointing out that
> > URDNA starts with an abstract RDF graph and produces a consistent
> > ordering and BNode labeling, and finally, a hash.
> 
> I'm confused by your terminology, because AFAIK URDNA2015 is only the
> canonicalization algorithm -- not the hashing algorithm.
> https://json-ld.github.io/rdf-dataset-canonicalization/spec/index.html
> I think you meant the "RDF Dataset Hash (RDH)" algorithm?
> https://w3c.github.io/lds-wg-charter/index.html

You're right, I mis-used the term URDNA. It's more like an invocation
of SHA256.


> But in any case, I disagree with your claim that the algorithm that you
> sketched above is not a home-grown hashing algorithm.  *YOU* wrote the
> algorithm above: it is *not* an off-the-shelf hashing algorithm, even if it
> makes use of URDNA2015 and an off-the-shelf hashing algorithm as a
> parameter/component.

Maybe we're disagreeing on the scope of the algorithm. I'd consider
the SHA256 algorithm to be defined in
<https://datatracker.ietf.org/doc/html/rfc4634#section-6.2>. Whether
it's passed a giant string or is called incrementally is an
engineering decision.

Taking JS as an example, once you have canonicalized quads in a
Dataset, there are two obvious choices to get your hash:

1. Pass something like `stringstream` to the graph's abstract
   serialization functions to produce a stream, then pass that
   string to `crypto` to get a hash.

2. `crypto` stream interface to those same serialization functions
   (bypassing the creation of a string).

They get the same answer, but the latter saves a step and is, I
believe, how most JS hackers would do it; no one likes to throw away
clock cycles and duplicating memory could matter to embedded apps like
on Web Of Things things.

Anyways, I just wanted to point out that you'll be using an
abstraction either to hash your sorted quads or to serialize
them to be hashed later.


> Vulnerabilities can be introduced even in seemingly simple variations of an
> algorithm.  I'm not claiming that your particular algorithm *has* a
> vulnerability, but I *am* claiming that we don't really know whether or not
> it has a vulnerability, because it has not been highly scrutinized by the
> world's crypto experts, and I don't think W3C should get involved in that
> question.  That's why I think it so important for this work to use an
> off-the-shelf hash algorithm to hash the fully materialized bytes of
> canonical N-Quads, which is essentially what the charter already proposes.
> Ivan views the canonical N-Quads as being equivalent to the canonicalized
> abstract RDF dataset, and I agree in theory, but in practice I think the
> distinction is *very* important *because* of the importance of using an
> unmodified off-the-shelf hashing algorithm on the fully *materialized*
> bytes.
> 
> > 
> > > > >  From a geek perspective, this abstract hashing function looks a little
> > > > arbitrary 'cause you could get just as strong a hash if you eliminated
> > > > all of the string constants (" ", "<", ">", ...). But keeping them is
> > > > cheap and allows you to serialize the result in a widely-supported RDF
> > > > syntax. RDF C14N, XML C14N and JWS's C14N all canonicalize to
> > > > serializable data, which makese it easier to implement and makes
> > > > user's lives way easier 'cause they can printf the input to the
> > > > hashing function.
> > > > 
> > > > I believe that describing hashing as a function over N-Quads suggests
> > > > that folks need to serialize as N-Quads to use it. I think the wording
> > > > describing it as function over an (abstract) RDF Graph (Dataset,
> > > > whatever) to say exactly what I as an RDF user want to know.
> > > 
> > > I appreciate that desire, but I think defining the hash over the abstract
> > > RDF would raise security concerns that would be better avoided, because it
> > > inherently entangles the hashing algorithm with RDF-specific code.  I think
> > > a safer approach is to use an off-the-shelf cryptographic hashing algorithm
> > > on the fully materialized bytes of canonical N-Quads, which is essentially
> > > what the charter currently describes.
> > 
> > In the email that started this sub-sub-sub-thread, you said:
> > 
> > || But I think at least one part of the confusion and concern is that
> > || the proposed canonicalization is framed as an *abstract* RDF Dataset
> > || canonicalization."
> > 
> > I think that means we're wordsmithing the description of RDF Dataset
> > Hash [artfully bulleted by myself]:
> > [[
> > This specification details the processing steps of a hash function for
> > an arbitrary RDF Dataset. These step include
> > 
> > (1) the generation of a canonical form using the algorithm specified
> >      in the “RDF Dataset Canonicalization” deliverable,
> > 
> > (2) sorting the N-Quads serialization of the canonical form generated
> >      in the previous step, and
> > 
> > (3) application of a cryptographic hash function on the sorted N-Quads
> >      representation.
> > ]] >
> > I think this text already says what you want about N-Quads. Can you
> > say what you'd want to change?
> 
> I basically agree with the existing algorithmic approach, *because* it takes
> the hash of the fully materialized N-Quads.  The changes I suggest are:
> 
> 1. I think the N-Quads canonicalization algorithm needs to be a *named*
> deliverable (possibly under its own charter), so that it can be easily
> referenced and used for other purposes.

We have a deliverable for RDC; I don't think we need a deliverable for
serializing RDC in N-Quads.


> 2. I think that the diff/patch use case should be included in the Explainer,
> and the proposed canonicalization algorithm should be tested and evaluated
> against the diff/patch use case, to see how well it performs.  I don't
> expect any canonicalization algorithm to be perfect for diff/patch.  If it
> is "good enough", then that's fine with me.  But if it performs poorly for
> common diff/patch use cases, then I think improvements should at least be
> seriously attempted before standardization.

While RDC could be useful for verifying diffs, I'm not optimistic that
it's going to be great for generating diffs that anyone would want to
use. Diff algorithms are typically meant to provide concise deltas
when things are inserted or deleted. Anything good for
canonicalization must provide a standard ordering; an inserted BNode
would change every following BNode.

I would be keen to see some research on this, but I'd be shocked if
researchers were able to discover and vet something fast enough to fit
in an acceptable time box.


> 3. I think the proposed hash should be characterized as being a hash of
> canonical N-Quads -- *not* a hash of the *abstract* RDF dataset.  I think it
> is slightly misleading (and causes confusion) to describe the hash as being
> a hash of the *abstract* RDF dataset.

I'm not seeing where you'd apply that change.


> > BTW, you referenced security concerns above. I order to make this more
> > concrete, I suggest that those security concerns would be:
> > 
> > SC1. (utility) does it get the same hash for any isomorphism of an RDF
> >       graph (regardless of triple/quad order)? (yes)
> > 
> > SC2. (safety) does every pair of non-isomorphic graphs give distinct
> >       hashes? (yes)
> > 
> > SC3. (safety) can one combine the hashes of subgraphs to force a
> >       specific hash value for a graph? (no)
> > 
> > I think that bullet (3) above answer's SC3 without changes to the
> > charter. SC1 and SC2 require a deep dive into RDC.
> 
> I don't believe that W3C should be doing work that raises those questions.
> Those are questions that crypto experts should be asking about *generic*
> hash algorithms -- not RDF-specific algorithms.  That's why I think it is so
> important to cleanly separate the N-Quads canonicalization from the hash
> computation.

They seem pretty unavoidable to me. Great care has gone into RDC to
make it secure in these ways, but those Qs are what you'd want to ask
during standardization. Canonicalization is designed to make the
answer to SC1 be "yes"; otherwise it would have much less utility. If
distinct (non-isomorphic) graphs canonicalized to the same graph or if
triples could be added to obtain a specific hash, you'd be able to
attack RDF Signatures.


> Thanks,
> David Booth
>
Received on Tuesday, 15 June 2021 11:58:25 UTC