Re: Thoughts on the LDS WG chartering discussion from David Booth on 2021-06-16 (semantic-web@w3.org from June 2021)

From: David Booth <david@dbooth.org>
Date: Wed, 16 Jun 2021 18:20:22 -0400
To: Eric Prud'hommeaux <eric@w3.org>
Cc: semantic-web <semantic-web@w3.org>
Message-ID: <8b1526a3-c0f9-aeaf-553e-0c1b87b86fbb@dbooth.org>
Hi Eric,

Summary:

  - I do not believe that we could ever come close to reaching the same 
level of confidence in the security of a combined RDF 
canonicalize-and-hash algorithm as we could if the N-Quads 
canonicalization and hashing were cleanly separated.

  - Given that the proposed canonicalize-and-hash algorithm would 
already specify how to compute an N-Quads canonicalization as an 
intermediate product, it would be a pointless waste not to give the 
N-Quads canonicalization algorithm an official name (as a deliverable), 
to enable it to be easily referenced and used in other contexts.

Details below . . .

On 6/15/21 7:58 AM, Eric Prud'hommeaux wrote:
> On Mon, Jun 14, 2021 at 12:33:05AM -0400, David Booth wrote:
>> Hi Eric,
>>
>> On 6/13/21 4:59 AM, Eric Prud'hommeaux wrote:
>>> On Sat, Jun 12, 2021 at 10:10:35PM -0400, David Booth wrote:
>>>> On 6/12/21 11:34 AM, Eric Prud'hommeaux wrote:
>>>>> . . .
>>>>> One could implement URDNA without ever serializing N-Quads with a
>>>>> hashCode() function on Quad, BNode, IRI and Literal: . . . >
>>>>> [[
>>>>> Hash Quad.urdna (h: Hasher) {
>>>>>      return h.add(
>>>>>        g.urdna(h), h(" "),
>>>>>        s.urdna(h), h(" "),
>>>>>        p.urdna(h), h(" "),
>>>>>        o.urdna(h), " .\n"
>>>>>      );
>>>>> }
>>>>>
>>>>> Hash IRI.urdna (h: Hasher) {
>>>>>      return h.add('<', value, '>');
>>>>> }
>>>>>
>>>>> Hash BNode.urdna (h: Hasher) {
>>>>>      static int ordinal = 1;
>>>>>      return h.add("_:c14n_" + ordinal++);
>>>>> }
>>>>> ]]
>>>>>
>>>>> Literal elided, and you'd want conditional code around Quad.g, but you
>>>>> get the idea. Note that this approach means you never need to
>>>>> duplicate your graph in memory as you do when constructing a
>>>>> potentially giant N-Quads string.
>>>>
>>>> If I'm understanding your intent, you are essentially showing that the hash
>>>> can be computed on the fly as a serialization is produced piecemeal, without
>>>> ever materializing the whole serialization at once.
>>>
>>> That's correct. Such an optimization would allow you to operate in
>>> considerably less memory (and exemplifies Aidan's point that this is
>>> an operation over an abstract RDF graph).
>>>
>>>
>>>> That could be done, but that is exactly what I would advise AGAINST doing,
>>>> because in essence you are defining a home grown hashing algorithm, even if
>>>> your algorithm makes use of an off-the-shelf hashing algorithm as a
>>>> component.  From a security perspective, I think it is unwise to do that.
>>>
>>> But it's NOT a home-grown hashing algorithm; it's URDNA (with BNOde
>>> ordering oversimplified for brevity). I was just pointing out that
>>> URDNA starts with an abstract RDF graph and produces a consistent
>>> ordering and BNode labeling, and finally, a hash.
>>
>> I'm confused by your terminology, because AFAIK URDNA2015 is only the
>> canonicalization algorithm -- not the hashing algorithm.
>> https://json-ld.github.io/rdf-dataset-canonicalization/spec/index.html
>> I think you meant the "RDF Dataset Hash (RDH)" algorithm?
>> https://w3c.github.io/lds-wg-charter/index.html
> 
> You're right, I mis-used the term URDNA. It's more like an invocation
> of SHA256.
> 
> 
>> But in any case, I disagree with your claim that the algorithm that you
>> sketched above is not a home-grown hashing algorithm.  *YOU* wrote the
>> algorithm above: it is *not* an off-the-shelf hashing algorithm, even if it
>> makes use of URDNA2015 and an off-the-shelf hashing algorithm as a
>> parameter/component.
> 
> Maybe we're disagreeing on the scope of the algorithm. I'd consider
> the SHA256 algorithm to be defined in
> <https://datatracker.ietf.org/doc/html/rfc4634#section-6.2>. Whether
> it's passed a giant string or is called incrementally is an
> engineering decision.
> 
> Taking JS as an example, once you have canonicalized quads in a
> Dataset, there are two obvious choices to get your hash:
> 
> 1. Pass something like `stringstream` to the graph's abstract
>     serialization functions to produce a stream, then pass that
>     string to `crypto` to get a hash.
> 
> 2. `crypto` stream interface to those same serialization functions
>     (bypassing the creation of a string).
> 
> They get the same answer, but the latter saves a step and is, I
> believe, how most JS hackers would do it; no one likes to throw away
> clock cycles and duplicating memory could matter to embedded apps like
> on Web Of Things things.

Agreed, and I definitely appreciate the efficiency motivation for doing 
that, but . . . (see below)

> 
> Anyways, I just wanted to point out that you'll be using an
> abstraction either to hash your sorted quads or to serialize
> them to be hashed later.

Agreed, but . . . (see below)

> 
> 
>> Vulnerabilities can be introduced even in seemingly simple variations of an
>> algorithm.  I'm not claiming that your particular algorithm *has* a
>> vulnerability, but I *am* claiming that we don't really know whether or not
>> it has a vulnerability, because it has not been highly scrutinized by the
>> world's crypto experts, and I don't think W3C should get involved in that
>> question.  That's why I think it so important for this work to use an
>> off-the-shelf hash algorithm to hash the fully materialized bytes of
>> canonical N-Quads, which is essentially what the charter already proposes.
>> Ivan views the canonical N-Quads as being equivalent to the canonicalized
>> abstract RDF dataset, and I agree in theory, but in practice I think the
>> distinction is *very* important *because* of the importance of using an
>> unmodified off-the-shelf hashing algorithm on the fully *materialized*
>> bytes.
>>
>>>
>>>>>>   From a geek perspective, this abstract hashing function looks a little
>>>>> arbitrary 'cause you could get just as strong a hash if you eliminated
>>>>> all of the string constants (" ", "<", ">", ...). But keeping them is
>>>>> cheap and allows you to serialize the result in a widely-supported RDF
>>>>> syntax. RDF C14N, XML C14N and JWS's C14N all canonicalize to
>>>>> serializable data, which makese it easier to implement and makes
>>>>> user's lives way easier 'cause they can printf the input to the
>>>>> hashing function.
>>>>>
>>>>> I believe that describing hashing as a function over N-Quads suggests
>>>>> that folks need to serialize as N-Quads to use it. I think the wording
>>>>> describing it as function over an (abstract) RDF Graph (Dataset,
>>>>> whatever) to say exactly what I as an RDF user want to know.
>>>>
>>>> I appreciate that desire, but I think defining the hash over the abstract
>>>> RDF would raise security concerns that would be better avoided, because it
>>>> inherently entangles the hashing algorithm with RDF-specific code.  I think
>>>> a safer approach is to use an off-the-shelf cryptographic hashing algorithm
>>>> on the fully materialized bytes of canonical N-Quads, which is essentially
>>>> what the charter currently describes.
>>>
>>> In the email that started this sub-sub-sub-thread, you said:
>>>
>>> || But I think at least one part of the confusion and concern is that
>>> || the proposed canonicalization is framed as an *abstract* RDF Dataset
>>> || canonicalization."
>>>
>>> I think that means we're wordsmithing the description of RDF Dataset
>>> Hash [artfully bulleted by myself]:
>>> [[
>>> This specification details the processing steps of a hash function for
>>> an arbitrary RDF Dataset. These step include
>>>
>>> (1) the generation of a canonical form using the algorithm specified
>>>       in the “RDF Dataset Canonicalization” deliverable,
>>>
>>> (2) sorting the N-Quads serialization of the canonical form generated
>>>       in the previous step, and
>>>
>>> (3) application of a cryptographic hash function on the sorted N-Quads
>>>       representation.
>>> ]] >
>>> I think this text already says what you want about N-Quads. Can you
>>> say what you'd want to change?
>>
>> I basically agree with the existing algorithmic approach, *because* it takes
>> the hash of the fully materialized N-Quads.  The changes I suggest are:
>>
>> 1. I think the N-Quads canonicalization algorithm needs to be a *named*
>> deliverable (possibly under its own charter), so that it can be easily
>> referenced and used for other purposes.
> 
> We have a deliverable for RDC; I don't think we need a deliverable for
> serializing RDC in N-Quads.

I completely disagree on that point.  The proposed RDF Dataset hashing 
algorithm *uses* a canonical N-Quads serialization, but simply treats it 
as an intermediate result whose only purpose is to be fed into the 
hashing function.  But a standard canonical N-Quads serialization has 
value in its own right, outside of digital signatures.  It would be a 
pointless waste to define N-Quads canonicalization, but never actually 
give it a name that can be easily referenced for other purposes.

> 
>> 2. I think that the diff/patch use case should be included in the Explainer,
>> and the proposed canonicalization algorithm should be tested and evaluated
>> against the diff/patch use case, to see how well it performs.  I don't
>> expect any canonicalization algorithm to be perfect for diff/patch.  If it
>> is "good enough", then that's fine with me.  But if it performs poorly for
>> common diff/patch use cases, then I think improvements should at least be
>> seriously attempted before standardization.
> 
> While RDC could be useful for verifying diffs, I'm not optimistic that
> it's going to be great for generating diffs that anyone would want to
> use. Diff algorithms are typically meant to provide concise deltas
> when things are inserted or deleted. Anything good for
> canonicalization must provide a standard ordering; an inserted BNode
> would change every following BNode.
> 
> I would be keen to see some research on this, but I'd be shocked if
> researchers were able to discover and vet something fast enough to fit
> in an acceptable time box.

I accept that it may turn out to be too difficult to make significant 
improvements to the existing algorithm, for the diff/patch use case.  My 
point is that the algorithm should at least be *evaluated* for this use 
case, to find out: 1. whether it is already "good enough"; and if not, 
2. whether it could be easily improved.  I think the diff/patch use case 
is too important to be ignored or dismissed out of hand.

Furthermore, even the digital signatures use case would benefit from 
using the *same* N-Quads canonicalization algorithm and implementations 
that are used for other use cases, because it would give them broader 
use and scrutiny.

> 
> 
>> 3. I think the proposed hash should be characterized as being a hash of
>> canonical N-Quads -- *not* a hash of the *abstract* RDF dataset.  I think it
>> is slightly misleading (and causes confusion) to describe the hash as being
>> a hash of the *abstract* RDF dataset.
> 
> I'm not seeing where you'd apply that change.

The "RDF Dataset Hash (RDH)" section of the proposed charter says, for 
example: "This specification details the processing steps of a hash 
function for an arbitrary RDF Dataset".
https://w3c.github.io/lds-wg-charter/index.html

> 
> 
>>> BTW, you referenced security concerns above. I order to make this more
>>> concrete, I suggest that those security concerns would be:
>>>
>>> SC1. (utility) does it get the same hash for any isomorphism of an RDF
>>>        graph (regardless of triple/quad order)? (yes)
>>>
>>> SC2. (safety) does every pair of non-isomorphic graphs give distinct
>>>        hashes? (yes)
>>>
>>> SC3. (safety) can one combine the hashes of subgraphs to force a
>>>        specific hash value for a graph? (no)
>>>
>>> I think that bullet (3) above answer's SC3 without changes to the
>>> charter. SC1 and SC2 require a deep dive into RDC.
>>
>> I don't believe that W3C should be doing work that raises those questions.
>> Those are questions that crypto experts should be asking about *generic*
>> hash algorithms -- not RDF-specific algorithms.  That's why I think it is so
>> important to cleanly separate the N-Quads canonicalization from the hash
>> computation.
> 
> They seem pretty unavoidable to me. Great care has gone into RDC to
> make it secure in these ways, but those Qs are what you'd want to ask
> during standardization. Canonicalization is designed to make the
> answer to SC1 be "yes"; otherwise it would have much less utility. If
> distinct (non-isomorphic) graphs canonicalized to the same graph or if
> triples could be added to obtain a specific hash, you'd be able to
> attack RDF Signatures.

Any mistake in the canonicalization algorithm could cause a 
vulnerability in the hash result.  That is true whether a combined 
canonicalize-and-hash algorithm is used, or the canonicalization and 
hashing are performed as separate steps.  And of course, it isn't just 
the specification of the algorithm that matters, it is also the 
*implementation* of each algorithm that matters.

But a mistake in a combined canonicalize-and-hash function will be much 
harder to detect than if separate functions are used.  A combined RDF 
canonicalize-and-hash algorithm/implementation will *only* be 
scrutinized by a some RDF geeks and a very few crypto geeks, while a 
generic hash algorith/implementation will be scrutinized by the world's 
experts.

Suppose there is a mistake in the canonicalization algorithm or 
implementation.  How would a user even notice the problem, if a combined 
RDF canonicalize-and-hash implementation were producing wrong results? 
  The resulting hash looks like a random string anyway.  A human isn't 
going to notice that every 1 in 10,000 inputs produces the same 
random-looking string -- i.e., a hash collision -- but 1 in 10,000 would 
be far too many collisions for the algorithm to be considered secure.

In contrast, if the RDF canonicalization were separated from the hash 
computation, and that same RDF canonicalization software were used for 
other purposes -- such as diff/patch -- then RDF users would be much 
more likely to notice if the software were adding, changing or deleting 
triples.

In short, I do not believe that we could ever have the same level of 
confidence in the security of a combined RDF canonicalize-and-hash 
algorithm as we could if the canonicalization and hashing were cleanly 
separated.  That is my fundamental concern about combining them.

Thanks,
David Booth
Received on Wednesday, 16 June 2021 22:20:52 UTC