Re: Chartering work has started for a Linked Data Signature Working Group @W3C from Manu Sporny on 2021-05-25 (semantic-web@w3.org from May 2021)

From: Manu Sporny <msporny@digitalbazaar.com>
Date: Tue, 25 May 2021 15:32:36 -0400
To: semantic-web@w3.org
Message-ID: <0ffa58c4-a1f2-1093-2e3e-0b77806bf3ac@digitalbazaar.com>
PFPS wrote:
> I would greatly appreciate a discussion of the possible flaws in that 
> document.  This discussion does not appear to be happening, which I find 
> worrisome.

I am attempting to engage in the discussion that you requested, Peter. I am
going to be pedantic in my response because you've made a number of technical
errors that caused you to come to the wrong conclusions.

At this point in time, it is clear that you either have not read the input
documents, or if you did, you missed a number of critical concepts in them
that caused you to create an incorrect mental model that then led to your
invalid conclusions.

My response is broken into high level statements and then thorough
explanation. This email is very long because I want you to know that we're
taking your input seriously and spending a LOT of time to try and address your
concerns.

I'm thankful that you're engaging given that you are an expert in the RDF
space (which is one of the types of input we need for this work to succeed).

> I take the method to sign and verify RDF datasets to be as follows:

Your summary of the algorithms are incorrect, are not what are in the papers
or the specs, and lead to the problems you identified.

> To my non-expert eye there are several significant problems here. 1/ The 
> signature extracted from the signed document might be different from the 
> signature used to sign the original document if the original document has 
> signatures in it.

Wrong.

The LDP algorithms prevent this from happening.

If the signature extracted from the signed document is different in any way,
the signature will fail to verify.

This is expected behaviour.

> 2/ The dataset extracted during verification might not be the dataset used
>  during signing because the original document if the original document has
>  signatures in it.

Wrong.

The LDP algorithms prevent this from happening.

If the dataset changes, the signature will fail to verify.

This is expected behaviour.

> 3/ Adding extra information after signing might be possible without 
> affecting verification if the extra information looks like a signature.

Wrong.

The LDP algorithms prevent this from happening.

Adding extra information after signing changes the hash, which will cause the
signature to fail to verify.

This is expected behaviour.

> 4/ The dataset extracted during verification might not be the dataset used
>  during signing because the original document has relative IRIs.

Wrong.

Relative IRIs are resolved against the base IRI before they go to into the
Canonicalization step. If the base IRI changes, the dataset changes and the
signature will fail to verify.

This is expected behaviour.

> 5/ The dataset extracted during verification might not be the dataset used
>  during signing because the original document is in a serialization that 
> uses external resources to generate the dataset (like @context in JSON-LD)
>  and this external resource may have changed.

Wrong.

If an external resources changes in a way that changes the dataset, then the
hash for the dataset will change causing the signature to fail to verify.

This is expected behaviour.

> 6/ Only the serialized dataset is signed so changing comments in 
> serializations that allow comments or other parts of the document that do 
> not encode triples or quads results can be done without affecting the 
> validity of the signature.  This is particularly problematic for RDFa.

By definition, that is not the problem that the LDS WG is solving. We are
signing RDF Datasets, if you have information that lives outside of an RDF
Dataset that you need to sign, we can't help you.

All information that is signed is in the RDF Dataset. If there is information
outside of the RDF Dataset (like comments), then it will not be signed. This
is true for ANY digital signature mechanism. This only becomes a problem if an
application depends on information that is not signed, at which point the
application developer really should consider signing the unsigned information.

This is expected behaviour.

> I welcome discussion of these points and am open to being proven wrong on 
> them.

You are wrong to varying degrees on every point above. :)

I'm going to elaborate on why below... starting with your definition of the
algorithms at play.

> sign(document, private key, identity)

Wrong.

Your function signature is incorrect and does not match what's in the current
LDP specification:

https://w3c-ccg.github.io/ld-proofs/#proof-algorithm

The inputs you provide are inadequate when it comes to protecting against
replay attacks, domain retargetting attacks, and identifying key material.

> let D be the RDF dataset serialized in document

Correct.

> let C be the canonicalized version of D

Correct.

> let S be triples representing a signature of C using private key

Wrong.

Not triples; quads. The proposed solution and algorithms are targeted at RDF
Datasets, not RDF Graphs. It is possible for some subset of the solution to
work on RDF Graphs, but the attack surface potentially gets larger and there
are more constraints that are required to make sure the data is being
processed correctly.

For example, if you try to apply the solution to RDF Graphs, nested signatures
in graph soup might become a headache (and this might be at the core of why
you think there is a problem).

The group will not be creating a solution for RDF Graphs in order to constrain
the focus of the correctness and security analysis.

> let signed document be document plus a serialization of S, so signed 
> document serializes D union (not merge) S

Wrong.

You skip right over a number of critical parts of the algorithm here (again,
your summary is wrong because you're eliminating security critical steps in
the c14n algorithm and Verify Hash Algorithm):

https://w3c-ccg.github.io/ld-proofs/#create-verify-hash-algorithm

For example, the RDF Dataset being signed is hashed *separately from* the RDF
signature options. That is, you have D /and/ S, which are separately hashed to
generate the signature, and then merged in the signed document. If you do not
separate these things correctly when you go to verify, your signature will
fail to verify. If you change signature options, your signature will fail to
verify. If you pollute your RDF Dataset with extra quads, your signature will
fail to verify. This is all expected behaviour and is important to the
security of the algorithm.

> return signed document

Correct. :)

> verify(signed document)

The specification will probably end up being updated during the LDS WG to
include an `options` field as that's what many implementations do today.

> let D' be the RDF dataset serialized in signed document

Correct.

> let S be the signature in D'

Wrong.

S could be a single signature, a set of signatures, or a chain of signatures.

> let D be D' - S

Wrong.

Assuming you change S to be "all proofs", then yes... but if you do that, the
rest of your algorithm lacks sufficient detail to be correct.

> let C be the canonicalized version of D

Correct.

> return whether S is a valid signature for C

Wrong. You skip over many of the algorithms that work to secure the RDF Dataset.

The algorithms for verifying a single signature, a set of signatures, and a
chain of signatures matter here. Admittedly, the spec doesn't elaborate on
these as we've really only seen single and set signatures used in the wild.
Signature chains seemed like a good idea, but we haven't really seen those
advanced use cases in the wild and so the LDS WG may decide that we want to
avoid spending time on those things. There is also work being done on
cryptographic circuits where you can support M-of-N signatures, and other
types of multi-party signatures.  I expect that work to be outside of the
scope of the LDS WG as well.

Additionally, much of the work has been using JSON-LD as the RDF Dataset
serialization format, where it's easy to understand where you're entering the
graph and what subject a set of proofs is attached to. For things like
N-Quads, TURTLE or other graph soup syntaxes, I expect that the algorithms
will need to be modified to specify the subject that the verifier is expecting
the proofs to be attached to (this will come into play later in the email).

> To my non-expert eye there are several significant problems here.

Wrong. There are many problems with the algorithms you provided, which are not
the algorithms in the specification.

> 1/ The signature extracted from the signed document might be different from
> the signature used to sign the original document if the original document
> has signatures in it.

Wrong.

The LDP algorithms prevent this from happening.

If the signature extracted from the signed document is different in any way,
the signature will fail to verify.

This is expected behaviour.

The algorithms that you use to verify a set of signatures and a chain of
signatures are different.

A set of signatures is expressed using the `proof` property.

A chain of signatures is expressed using the `proofChain` property.

It is not possible to mix both `proof` and `proofChain` in a single dataset
and get a deterministic ordering of signatures. The LDP specification will
probably, after LDS WG review, state that you MUST NOT do so... or we might
not support chained signatures at all.

Also keep in mind that the algorithm needs to understand which subject the
proof/proofChain properties are attached to. In JSON-LD, this is easy -- it's
whatever subject the top level object describes. In TURTLE or NQuads, you have
to tell the algorithm which subject is associated with the proof/proofChain
properties. Keep in mind that we didn't specify this in the algorithms yet
because, again, this is something that the RDF WG needs to consider as it may
be possible to make this subject detection more automatic in TURTLE or NQuads.
This is a small, but important digression, and is probably a gap in your
knowledge about how all of this stuff is expected to work across multiple
serializations.

So, you're either dealing with one or more proofs associated with the `proof`
property, or you're dealing with one or more proofs associated with the
`proofChain` property.

For a set of signatures, the general algorithm is:

1. Remove `proof` (an unordered set) from the RDF Dataset
   that is associated with the given subject.
2. Iterate over each proof in any order and apply the
   Proof Verification Algorithm:
https://w3c-ccg.github.io/ld-proofs/#proof-verification-algorithm

The current algorithm in the specification doesn't state this because it's not
clear if the LDS WG is going to want to externalize this looping or
internalize it in the algorithm above.

For a chain of signatures, the general algorithm is:

1. Remove `proofChain` (an ordered list) from the RDF
   Dataset that is associated with the given subject.
2. Iterate over each proof in reverse order, adding
   the all proofs before it into the RDF Dataset and
   verifying against the last proof using the Proof Verification Algorithm:
https://w3c-ccg.github.io/ld-proofs/#proof-verification-algorithm

Again, we don't elaborate on this procedure because the vast majority of LDS
today just do single signatures and so it may be that we end up not defining
this in the specification.

To be clear -- these algorithms are fairly straight forward (as they are just
variations on verifying a single digital signature) and their correctness
depends on the RDF Dataset Canonicalization algorithm and the use of well
known and vetted cryptographic hashing and digital signature algorithms. In
the very worst case, if the LDS WG doesn't feel comfortable supporting either
set or chained signatures, then the work could be constrained to a single
signature... and that is a topic of debate for the LDS WG.

> 2/ The dataset extracted during verification might not be the dataset used
>  during signing because the original document if the original document has
>  signatures in it.

Wrong.

The LDP algorithms prevent this from happening.

If the dataset changes, the signature will fail to verify.

This is expected behaviour.

As explained above, if the original dataset contained signatures, then those
signatures are canonicalized and signed.

The verification algorithm only removes the signatures from the RDF Dataset
that it is instructed to verify. That is, the proofs are bound to a particular
subject and it is those proofs that are removed and used during signature
verification using the general algorithms listed previously in this email
(and/or in the specification).

Each proof is contained in its own RDF Dataset, so there is no
cross-contamination between the proofs and the RDF Dataset containing the
non-proof data. That is, the algorithm can surgically remove the proofs that
are intended to be used during verification and leave other proofs that are
included in the canonicalized data alone. Doing so addresses the
recursion/embedding concern that both you and Dan raised.

> 3/ Adding extra information after signing might be possible without 
> affecting verification if the extra information looks like a signature.

Wrong.

The LDP algorithms prevent this from happening.

Adding extra information after signing changes the hash, which will cause the
signature to fail to verify.

This is expected behaviour.

The Linked Data Proofs algorithms hash and sign *every Quad*. This includes
the original RDF Dataset as well as all canonicalized options (i.e., signature
options minus the digital signature itself). This is detailed in the
specification here:

https://w3c-ccg.github.io/ld-proofs/#create-verify-hash-algorithm

This was a very deliberate design choice... other signature schemes, like
JWTs, allow unsigned data. LDP takes a more strict approach... you cannot
inject a Quad into either the original RDF Dataset OR the canonicalized
options and get the same hash (modulo a bonafide hash collision). In other
words, you cannot inject anything, anywhere that is covered by the signature
(which is everything)... especially "extra information that looks like a
signature" because that information is included in the signature.

> 4/ The dataset extracted during verification might not be the dataset used
>  during signing because the original document has relative IRIs.

Wrong.

Relative IRIs are resolved against the base IRI, if the base IRI changes, the
dataset changes and the signature will fail to verify.

This is expected behaviour.

Relative IRI resolution happens before canonicalization occurs. The JSON-LD
Playground (and underlying libraries) certainly do this as a part of JSON-LD
expansion:

https://www.w3.org/TR/json-ld11-api/#iri-expansion

RDF 1.1 Concepts states that "Relative IRIs must be resolved against a base
IRI to make them absolute. Therefore, the RDF graph serialized in such
syntaxes is well-defined only if a base IRI can be established [RFC3986]."

We could add language to LDP that states that either 1) all inputs must be
well-defined RDF Datasets, 2) all input IRIs MUST be absolute, 3)  any input
that contains a relative IRI and no base IRI as input is invalid (and do IRI
expansion in the canonicalization spec), or some other language that makes
this more clear.

Again, this is something that an LDS WG should debate and come to consensus on
given that the needs here are not just focused on JSON-LD and are not just
focused on Verifiable Credentials.

> 5/ The dataset extracted during verification might not be the dataset used
>  during signing because the original document is in a serialization that 
> uses external resources to generate the dataset (like @context in JSON-LD)
>  and this external resource may have changed.

Wrong; this is not a problem -- it's expected behaviour.

If an external resource changes in a way that changes the dataset, then the
hash for the dataset will change causing the signature to fail to verify.

This is expected behaviour.

For example, if you pull in a JSON-LD Context (J1) and use it to generate Quads,
canonicalize, and sign... and then the context changes to (J2) that changes
terms or `@base` or anything else that modifies the IRIs that were signed,
when the verifier converts the input to Quads, canonicalizes and checks the
signature, the signature will be invalid, because the generated hash changed
due to the IRIs in the RDF Dataset changing.

> 6/ Only the serialized dataset is signed so changing comments in 
> serializations that allow comments or other parts of the document that do 
> not encode triples or quads results can be done without affecting the 
> validity of the signature.  This is particularly problematic for RDFa.

By definition, that is not the problem that the LDS WG is solving. We are
signing RDF Datasets, if you have information that lives outside of an RDF
Dataset that you need to sign, we can't help you.

All information that is signed is in the RDF Dataset. If there is information
outside of the RDF Dataset (like comments), then it will not be signed. This
is true for ANY digital signature mechanism. This only becomes a problem if an
application depends on information that is not signed, at which point the
application developer really should consider signing the unsigned information.

This is expected behaviour.

This is not a problem for RDFa if the information you want to sign is the
underlying RDF Dataset. If you want to sign a blob of HTML that contains RDFa,
then you need to grab that blob of HTML and encapsulate it in the RDF Dataset
and digitally sign that... or you need to use a different digital signature
mechanism that just signs everything, including spaces, tabs, and other
unnecessary things that if they change, will break the signature.

Having the digital proof cover things outside of an RDF Dataset is almost
entirely out of scope. The only thing that is in scope is if you want to embed
the HTML as a literal, for example... and in that case, you can use an RDF
Dataset and LDP to do that.

----------------

I hope this explains how all of the problems you raised were either 1) not
problems, 2) previously known with mitigations in place 3) solved with a few
sentences of documentation, or 4) not an issue and also out of scope of the
LDS WG.

I hope it's also clear that a large percentage of the questions you had
require RDF expertise to understand rather than "security expert" expertise.
While we have had input from both RDF experts and security experts, it's still
not clear what sort of expertise you're looking to when analysing these
algorithms. It's true that you need both sorts of people in the same room, and
is thus why we are forming an LDS WG *and* have entities like the IETF
Cryptography Forum Research Group, the National Institute of Standards
(currently engaged), and other "security experts" listed in the Coordination
section:

https://w3c.github.io/lds-wg-charter/#coordination

I hope these answers were helpful to you and I'm happy to answer other
relevant questions you may have.

What I would like from you in return are concrete suggestions on changes to
the specification, issues raised, or specific parties (by name or detailed
qualification) you feel should be a part of the discussion. Requesting that we
bring in "security experts" is not helpful... it's like asking if we've had
"RDF experts" sign-off on the algorithms. Just about every "real RDF expert" I
know would claim that they're not one... because they understand how broad and
deep that particular body of water is.

-- manu

-- 
Manu Sporny - https://www.linkedin.com/in/manusporny/
Founder/CEO - Digital Bazaar, Inc.
blog: Veres One Decentralized Identifier Blockchain Launches
https://tinyurl.com/veres-one-launches
Received on Tuesday, 25 May 2021 19:32:56 UTC