- From: David Booth <david@dbooth.org>
- Date: Mon, 14 Jun 2021 00:33:05 -0400
- To: semantic-web@w3.org
Hi Eric, On 6/13/21 4:59 AM, Eric Prud'hommeaux wrote: > On Sat, Jun 12, 2021 at 10:10:35PM -0400, David Booth wrote: >> On 6/12/21 11:34 AM, Eric Prud'hommeaux wrote: >>> . . . >>> One could implement URDNA without ever serializing N-Quads with a >>> hashCode() function on Quad, BNode, IRI and Literal: . . . > >>> [[ >>> Hash Quad.urdna (h: Hasher) { >>> return h.add( >>> g.urdna(h), h(" "), >>> s.urdna(h), h(" "), >>> p.urdna(h), h(" "), >>> o.urdna(h), " .\n" >>> ); >>> } >>> >>> Hash IRI.urdna (h: Hasher) { >>> return h.add('<', value, '>'); >>> } >>> >>> Hash BNode.urdna (h: Hasher) { >>> static int ordinal = 1; >>> return h.add("_:c14n_" + ordinal++); >>> } >>> ]] >>> >>> Literal elided, and you'd want conditional code around Quad.g, but you >>> get the idea. Note that this approach means you never need to >>> duplicate your graph in memory as you do when constructing a >>> potentially giant N-Quads string. >> >> If I'm understanding your intent, you are essentially showing that the hash >> can be computed on the fly as a serialization is produced piecemeal, without >> ever materializing the whole serialization at once. > > That's correct. Such an optimization would allow you to operate in > considerably less memory (and exemplifies Aidan's point that this is > an operation over an abstract RDF graph). > > >> That could be done, but that is exactly what I would advise AGAINST doing, >> because in essence you are defining a home grown hashing algorithm, even if >> your algorithm makes use of an off-the-shelf hashing algorithm as a >> component. From a security perspective, I think it is unwise to do that. > > But it's NOT a home-grown hashing algorithm; it's URDNA (with BNOde > ordering oversimplified for brevity). I was just pointing out that > URDNA starts with an abstract RDF graph and produces a consistent > ordering and BNode labeling, and finally, a hash. I'm confused by your terminology, because AFAIK URDNA2015 is only the canonicalization algorithm -- not the hashing algorithm. https://json-ld.github.io/rdf-dataset-canonicalization/spec/index.html I think you meant the "RDF Dataset Hash (RDH)" algorithm? https://w3c.github.io/lds-wg-charter/index.html But in any case, I disagree with your claim that the algorithm that you sketched above is not a home-grown hashing algorithm. *YOU* wrote the algorithm above: it is *not* an off-the-shelf hashing algorithm, even if it makes use of URDNA2015 and an off-the-shelf hashing algorithm as a parameter/component. Vulnerabilities can be introduced even in seemingly simple variations of an algorithm. I'm not claiming that your particular algorithm *has* a vulnerability, but I *am* claiming that we don't really know whether or not it has a vulnerability, because it has not been highly scrutinized by the world's crypto experts, and I don't think W3C should get involved in that question. That's why I think it so important for this work to use an off-the-shelf hash algorithm to hash the fully materialized bytes of canonical N-Quads, which is essentially what the charter already proposes. Ivan views the canonical N-Quads as being equivalent to the canonicalized abstract RDF dataset, and I agree in theory, but in practice I think the distinction is *very* important *because* of the importance of using an unmodified off-the-shelf hashing algorithm on the fully *materialized* bytes. > >>>> From a geek perspective, this abstract hashing function looks a little >>> arbitrary 'cause you could get just as strong a hash if you eliminated >>> all of the string constants (" ", "<", ">", ...). But keeping them is >>> cheap and allows you to serialize the result in a widely-supported RDF >>> syntax. RDF C14N, XML C14N and JWS's C14N all canonicalize to >>> serializable data, which makese it easier to implement and makes >>> user's lives way easier 'cause they can printf the input to the >>> hashing function. >>> >>> I believe that describing hashing as a function over N-Quads suggests >>> that folks need to serialize as N-Quads to use it. I think the wording >>> describing it as function over an (abstract) RDF Graph (Dataset, >>> whatever) to say exactly what I as an RDF user want to know. >> >> I appreciate that desire, but I think defining the hash over the abstract >> RDF would raise security concerns that would be better avoided, because it >> inherently entangles the hashing algorithm with RDF-specific code. I think >> a safer approach is to use an off-the-shelf cryptographic hashing algorithm >> on the fully materialized bytes of canonical N-Quads, which is essentially >> what the charter currently describes. > > In the email that started this sub-sub-sub-thread, you said: > > || But I think at least one part of the confusion and concern is that > || the proposed canonicalization is framed as an *abstract* RDF Dataset > || canonicalization." > > I think that means we're wordsmithing the description of RDF Dataset > Hash [artfully bulleted by myself]: > [[ > This specification details the processing steps of a hash function for > an arbitrary RDF Dataset. These step include > > (1) the generation of a canonical form using the algorithm specified > in the “RDF Dataset Canonicalization” deliverable, > > (2) sorting the N-Quads serialization of the canonical form generated > in the previous step, and > > (3) application of a cryptographic hash function on the sorted N-Quads > representation. > ]] > > I think this text already says what you want about N-Quads. Can you > say what you'd want to change? I basically agree with the existing algorithmic approach, *because* it takes the hash of the fully materialized N-Quads. The changes I suggest are: 1. I think the N-Quads canonicalization algorithm needs to be a *named* deliverable (possibly under its own charter), so that it can be easily referenced and used for other purposes. 2. I think that the diff/patch use case should be included in the Explainer, and the proposed canonicalization algorithm should be tested and evaluated against the diff/patch use case, to see how well it performs. I don't expect any canonicalization algorithm to be perfect for diff/patch. If it is "good enough", then that's fine with me. But if it performs poorly for common diff/patch use cases, then I think improvements should at least be seriously attempted before standardization. 3. I think the proposed hash should be characterized as being a hash of canonical N-Quads -- *not* a hash of the *abstract* RDF dataset. I think it is slightly misleading (and causes confusion) to describe the hash as being a hash of the *abstract* RDF dataset. > > BTW, you referenced security concerns above. I order to make this more > concrete, I suggest that those security concerns would be: > > SC1. (utility) does it get the same hash for any isomorphism of an RDF > graph (regardless of triple/quad order)? (yes) > > SC2. (safety) does every pair of non-isomorphic graphs give distinct > hashes? (yes) > > SC3. (safety) can one combine the hashes of subgraphs to force a > specific hash value for a graph? (no) > > I think that bullet (3) above answer's SC3 without changes to the > charter. SC1 and SC2 require a deep dive into RDC. I don't believe that W3C should be doing work that raises those questions. Those are questions that crypto experts should be asking about *generic* hash algorithms -- not RDF-specific algorithms. That's why I think it is so important to cleanly separate the N-Quads canonicalization from the hash computation. Thanks, David Booth
Received on Monday, 14 June 2021 04:33:50 UTC