Re: Chartering work has started for a Linked Data Signature Working Group @W3C from Peter F. Patel-Schneider on 2021-05-26 (semantic-web@w3.org from May 2021)

From: Peter F. Patel-Schneider <pfpschneider@gmail.com>
Date: Wed, 26 May 2021 10:08:36 -0400
To: semantic-web@w3.org
Message-ID: <38377142-cbe0-23b5-0acc-ec7bd8140d10@gmail.com>
[The referenced email appears to have a large amount of duplicated text. I 
have only responded to the part before the duplication starts.]

Here are several attacks that I believe can be carried out against the 
algorithms in https://w3c-ccg.github.io/ld-proofs/#algorithms.

Attack 1 is probably difficult to do and doesn't get much, but it does get 
consumers to believe that a producer signed something the producer didn't.  
The producer creates a file  containing relative IRIs that serializes G if the 
base IRI is the retrieval IRI in the context where the consumer's verification 
algorithm will be run. The producer then signs this file, with the base IRI  
this retrieval IRI.  The consumer's verification function will succeed on the 
signed file.  But when the consumer actually deserializes the signed file the 
retrieval IRI may be different from the retrieval IRI that was used in the 
verification algorithm.

Attack 2 is less difficult but requires something like the JSON-LD @context 
mechanism.   A producer signs a document that has a remote context that is 
under the control of a third party.  The consumer verifies the signed 
document, which is successful because the first time the consumer asks for the 
remote context the same information is sent, and sent as expiring 
immediately.  The third party then sends different remote context the next 
time the consumer asks for it so that when the consumer deserializes the 
signed document the consumer sees an RDF dataset that is not what the producer 
signed.

Attack 3 depends on the presence of multiple proof nodes.  Suppose the 
original graph already contains a proof node.  The producer signs this graph.  
The consumer, as part of the verification process removes all proof nodes and 
tries to verify the signed document minus the proof nodes.  The verification 
of the signature fails because the proof node in the original graph is not 
present.

Attack 4 also depends on the presence of multiple proof nodes and exploits a 
flaw in how the verify hash algorithm is specified. Suppose an opponent 
creates a fake signing of an original graph. The opponent then signs the 
"signed" graph.  The consumer then takes the proof nodes out of the graph that 
the opponent has signed.  The create verify hash algorithm is given two proof 
nodes but only expects one and only verifies the opponent's signature.  The 
consumer then deserializes the graph it received and believes that the fake 
signature has been verified.

Attack 5 is similar to attack 4 except that the false information is added 
afterwards.  Suppose a producer signs a linked data document. Then an opponent 
adds an extra signature, either fake or real.  The consumer then takes the 
proof nodes out of the graph that the opponent has modified.  The create 
verify hash algorithm is given two proof nodes but only expects one and only 
verifies the producer's signature.  The consumer then deserializes the graph 
it received and believes that the signature the opponent inserted has been 
verified.


On Tue, 2021-05-25 at 15:32 -0400, Manu Sporny wrote:
 > PFPS wrote:
 > > I would greatly appreciate a discussion of the possible flaws in that
 > > document.  This discussion does not appear to be happening, which I find
 > > worrisome.
 >
 > I am attempting to engage in the discussion that you requested, Peter. I am
 > going to be pedantic in my response because you've made a number of technical
 > errors that caused you to come to the wrong conclusions.
 >
 > At this point in time, it is clear that you either have not read the input
 > documents, or if you did, you missed a number of critical concepts in them
 > that caused you to create an incorrect mental model that then led to your
 > invalid conclusions.
 >
 > My response is broken into high level statements and then thorough
 > explanation. This email is very long because I want you to know that we're
 > taking your input seriously and spending a LOT of time to try and
 > address your concerns.
 >
 > I'm thankful that you're engaging given that you are an expert in the RDF
 > space (which is one of the types of input we need for this work to succeed).
 >
 > > I take the method to sign and verify RDF datasets to be as follows:
 >
 > Your summary of the algorithms are incorrect, are not what are in the
 > papers
 > or the specs, and lead to the problems you identified.

See below.

 > > To my non-expert eye there are several significant problems here.
 > > 1/ The
 > > signature extracted from the signed document might be different
 > > from the
 > > signature used to sign the original document if the original
 > > document has
 > > signatures in it.
 >
 > Wrong.
 >
 > The LDP algorithms prevent this from happening.
 >
 > If the signature extracted from the signed document is different in any way,
 > the signature will fail to verify.
 >
 > This is expected behaviour.

See attack 3.  Attack 4 is also relevant here..

 > > 2/ The dataset extracted during verification might not be the
 > > dataset used
 > >  during signing because the original document if the original
 > > document has
 > >  signatures in it.
 >
 > Wrong.
 >
 > The LDP algorithms prevent this from happening.
 >
 > If the dataset changes, the signature will fail to verify.
 >
 > This is expected behaviour.

See attack 3.

 > > 3/ Adding extra information after signing might be possible without
 > > affecting verification if the extra information looks like a
 > > signature.
 >
 > Wrong.
 >
 > The LDP algorithms prevent this from happening.
 >
 > Adding extra information after signing changes the hash, which will
 > cause the
 > signature to fail to verify.
 >
 > This is expected behaviour.

See Attack 5.

 > > 4/ The dataset extracted during verification might not be the
 > > dataset used
 > >  during signing because the original document has relative IRIs.
 >
 > Wrong.

You seem to be saying here that relative IRIs don't cause verification to fail.

 > Relative IRIs are resolved against the base IRI before they go to
 > into the
 > Canonicalization step. If the base IRI changes, the dataset changes
 > and the
 > signature will fail to verify.

Here you appear to be saying that relative IRIs can cause verification to fail.

 > This is expected behaviour.

See Attack 1, which fiddles with relative IRIs in a way that verification 
suceeds but the consumer believes a different RDF dataset has been verified.

 > > 5/ The dataset extracted during verification might not be the
 > > dataset used
 > >  during signing because the original document is in a serialization
 > > that
 > > uses external resources to generate the dataset (like @context in
 > > JSON-LD)
 > >  and this external resource may have changed.
 >
 > Wrong.

As above.

 > If an external resources changes in a way that changes the dataset,
 > then the
 > hash for the dataset will change causing the signature to fail to
 > verify.

As above.

 > This is expected behaviour.

See Attack 2, which fiddles with remote contexts in a way that verification 
suceeds but the consumer believes a different RDF dataset has been verified.

 > > 6/ Only the serialized dataset is signed so changing comments in
 > > serializations that allow comments or other parts of the document
 > > that do
 > > not encode triples or quads results can be done without affecting
 > > the
 > > validity of the signature.  This is particularly problematic for
 > > RDFa.
 >
 > By definition, that is not the problem that the LDS WG is solving. We
 > are
 > signing RDF Datasets, if you have information that lives outside of
 > an RDF
 > Dataset that you need to sign, we can't help you.
 >
 > All information that is signed is in the RDF Dataset. If there is
 > information
 > outside of the RDF Dataset (like comments), then it will not be
 > signed. This
 > is true for ANY digital signature mechanism. This only becomes a
 > problem if an
 > application depends on information that is not signed, at which point
 > the
 > application developer really should consider signing the unsigned
 > information.
 >
 > This is expected behaviour.

The may be the *defined* behaviour, but there may be consumers who believe 
that non-coding parts of the document have been signed.

 > > I welcome discussion of these points and am open to being proven
 > > wrong on
 > > them.
 >
 > You are wrong to varying degrees on every point above. :)

I disagree.  I believe I have outline attacks that exhibit each of these 
problems.  As all I have to go on is the high-level description in 
https://w3c-ccg.github.io/ld-proofs/#algorithms some of these attacks may not 
be exhibited in some implementations.  I am awaiting a reference 
implementation of the algorithms in 
https://w3c-ccg.github.io/ld-proofs/#algorithms.



 > I'm going to elaborate on why below... starting with your definition of the
 > algorithms at play.
 >
 > > sign(document, private key, identity)
 >
 > Wrong.
 >
 > Your function signature is incorrect and does not match what's in the
 > current
 > LDP specification:
 >
 > https://w3c-ccg.github.io/ld-proofs/#proof-algorithm
 >
 > The inputs you provide are inadequate when it comes to protecting
 > against
 > replay attacks, domain retargetting attacks, and identifying key
 > material.

Yes, I am missing the date.   The domain is optional.  The identity contains 
or points to a public/private key pair.

 > > let D be the RDF dataset serialized in document
 >
 > Correct.
 >
 > > let C be the canonicalized version of D
 >
 > Correct.
 >
 > > let S be triples representing a signature of C using private key
 >
 > Wrong.
 >
 > Not triples; quads. The proposed solution and algorithms are targeted
 > at RDF
 > Datasets, not RDF Graphs. It is possible for some subset of the
 > solution to
 > work on RDF Graphs, but the attack surface potentially gets larger
 > and there
 > are more constraints that are required to make sure the data is being
 > processed correctly.

The signature information is added to the default graph, as far as I can tell, 
so triples are adequate.

 > For example, if you try to apply the solution to RDF Graphs, nested signatures
 > in graph soup might become a headache (and this might be at the core of why
 > you think there is a problem).
 >
 > The group will not be creating a solution for RDF Graphs in order to constrain
 > the focus of the correctness and security analysis.

What does it matter if the serialized information is an RDF dataset or just an 
RDF graph?  An RDF graph is, in essence, an RDF dataset and the added 
information is added to the default graph.  So any problem with just RDF 
graphs is also present if RDF datasets are allowed.

 > > let signed document be document plus a serialization of S, so signed
 > > document serializes D union (not merge) S
 >
 > Wrong.
 >
 > You skip right over a number of critical parts of the algorithm here
 > (again,
 > your summary is wrong because you're eliminating security critical
 > steps in
 > the c14n algorithm and Verify Hash Algorithm):

I do skip over a part of the algorithm, but the result of steps 2 through 4 of 
https://w3c-ccg.github.io/ld-proofs/#proof-algorithm is a proof value (or 
signature for just signing) which is serialized and added to the document in 
step 5 so I think my summary is adequate.

 > https://w3c-ccg.github.io/ld-proofs/#create-verify-hash-algorithm
 >
 > For example, the RDF Dataset being signed is hashed *separately from*
 > the RDF
 > signature options. That is, you have D /and/ S, which are separately
 > hashed to
 > generate the signature, and then merged in the signed document. If
 > you do not
 > separate these things correctly when you go to verify, your signature
 > will
 > fail to verify. If you change signature options, your signature will
 > fail to
 > verify. If you pollute your RDF Dataset with extra quads, your
 > signature will
 > fail to verify. This is all expected behaviour and is important to
 > the
 > security of the algorithm.

Agreed, but my summary just wraps all that up into a single action.

 > > return signed document
 >
 > Correct. :)
 >
 > > verify(signed document)
 >
 > The specification will probably end up being updated during the LDS WG to
 > include an `options` field as that's what many implementations do today.
 >
 > > let D' be the RDF dataset serialized in signed document
 >
 > Correct.
 >
 > > let S be the signature in D'
 >
 > Wrong.
 >
 > S could be a single signature, a set of signatures, or a chain of
 > signatures.

The extraction in step 3 extracts all the proof nodes but is then fed into
https://w3c-ccg.github.io/ld-proofs/#create-verify-hash-algorithm which 
appears to accept a single proof.  There are other places where proof value 
also appears to be a single proof.
In any case, the proof nodes are all removed.

 > > let D be D' - S
 >
 > Wrong.
 >
 > Assuming you change S to be "all proofs", then yes... but if you do
 > that, the
 > rest of your algorithm lacks sufficient detail to be correct.

OK, S is all proof nodes in D'.

 > > let C be the canonicalized version of D
 >
 > Correct.
 >
 > > return whether S is a valid signature for C
 >
 > Wrong. You skip over many of the algorithms that work to secure the
 > RDF Dataset.

I do skip over the details of determining whether S is valid or not, but I 
don't think my summary is incorrect.  I do believe that my comments above on 
multiplicity of signatures are correct.

 > The algorithms for verifying a single signature, a set of signatures, and a
 > chain of signatures matter here. Admittedly, the spec doesn't elaborate on
 > these as we've really only seen single and set signatures used in the wild.
 > Signature chains seemed like a good idea, but we haven't really seen those
 > advanced use cases in the wild and so the LDS WG may decide that we want to
 > avoid spending time on those things. There is also work being done on
 > cryptographic circuits where you can support M-of-N signatures, and other
 > types of multi-party signatures.  I expect that work to be outside of the
 > scope of the LDS WG as well.

 > Additionally, much of the work has been using JSON-LD as the RDF Dataset
 > serialization format, where it's easy to understand where you're entering the
 > graph and what subject a set of proofs is attached to. For things like
 > N-Quads, TURTLE or other graph soup syntaxes, I expect that the algorithms
 > will need to be modified to specify the subject that the verifier is expecting
 > the proofs to be attached to (this will come into play later in the email).

Is this so?  How does one determine whether one signature is included in 
another signing in JSON-LD?


[Start of duplicated content.]

 > > To my non-expert eye there are several significant problems here.
 >
 > Wrong. There are many problems with the algorithms you provided,
 > which are not
 > the algorithms in the specification.
 >
 > > 1/ The signature extracted from the signed document might be
 > > different from
 > > the signature used to sign the original document if the original
 > > document
 > > has signatures in it.
 >
 > Wrong.
 >
 > The LDP algorithms prevent this from happening.
 >
 > If the signature extracted from the signed document is different in
 > any way,
 > the signature will fail to verify.
 >
 > This is expected behaviour.
 >
 > The algorithms that you use to verify a set of signatures and a chain
 > of
 > signatures are different.
 >
 > A set of signatures is expressed using the `proof` property.
 >
 > A chain of signatures is expressed using the `proofChain` property.
 >
 > It is not possible to mix both `proof` and `proofChain` in a single
 > dataset
 > and get a deterministic ordering of signatures. The LDP specification
 > will
 > probably, after LDS WG review, state that you MUST NOT do so... or we
 > might
 > not support chained signatures at all.
 >
 > Also keep in mind that the algorithm needs to understand which
 > subject the
 > proof/proofChain properties are attached to. In JSON-LD, this is easy
 > -- it's
 > whatever subject the top level object describes. In TURTLE or NQuads,
 > you have
 > to tell the algorithm which subject is associated with the
 > proof/proofChain
 > properties. Keep in mind that we didn't specify this in the
 > algorithms yet
 > because, again, this is something that the RDF WG needs to consider
 > as it may
 > be possible to make this subject detection more automatic in TURTLE
 > or NQuads.
 > This is a small, but important digression, and is probably a gap in
 > your
 > knowledge about how all of this stuff is expected to work across
 > multiple
 > serializations.
 >
 > So, you're either dealing with one or more proofs associated with the
 > `proof`
 > property, or you're dealing with one or more proofs associated with
 > the
 > `proofChain` property.
 >
 > For a set of signatures, the general algorithm is:
 >
 > 1. Remove `proof` (an unordered set) from the RDF Dataset
 >    that is associated with the given subject.
 > 2. Iterate over each proof in any order and apply the
 >    Proof Verification Algorithm:
 > https://w3c-ccg.github.io/ld-proofs/#proof-verification-algorithm
 >
 > The current algorithm in the specification doesn't state this because
 > it's not
 > clear if the LDS WG is going to want to externalize this looping or
 > internalize it in the algorithm above.
 >
 > For a chain of signatures, the general algorithm is:
 >
 > 1. Remove `proofChain` (an ordered list) from the RDF
 >    Dataset that is associated with the given subject.
 > 2. Iterate over each proof in reverse order, adding
 >    the all proofs before it into the RDF Dataset and
 >    verifying against the last proof using the Proof Verification
 > Algorithm:
 > https://w3c-ccg.github.io/ld-proofs/#proof-verification-algorithm
 >
 > Again, we don't elaborate on this procedure because the vast majority
 > of LDS
 > today just do single signatures and so it may be that we end up not
 > defining
 > this in the specification.
 >
 > To be clear -- these algorithms are fairly straight forward (as they
 > are just
 > variations on verifying a single digital signature) and their
 > correctness
 > depends on the RDF Dataset Canonicalization algorithm and the use of
 > well
 > known and vetted cryptographic hashing and digital signature
 > algorithms. In
 > the very worst case, if the LDS WG doesn't feel comfortable
 > supporting either
 > set or chained signatures, then the work could be constrained to a
 > single
 > signature... and that is a topic of debate for the LDS WG.
 >
 > > 2/ The dataset extracted during verification might not be the
 > > dataset used
 > >  during signing because the original document if the original
 > > document has
 > >  signatures in it.
 >
 > Wrong.
 >
 > The LDP algorithms prevent this from happening.
 >
 > If the dataset changes, the signature will fail to verify.
 >
 > This is expected behaviour.
 >
 > As explained above, if the original dataset contained signatures,
 > then those
 > signatures are canonicalized and signed.
 >
 > The verification algorithm only removes the signatures from the RDF
 > Dataset
 > that it is instructed to verify. That is, the proofs are bound to a
 > particular
 > subject and it is those proofs that are removed and used during
 > signature
 > verification using the general algorithms listed previously in this
 > email
 > (and/or in the specification).
 >
 > Each proof is contained in its own RDF Dataset, so there is no
 > cross-contamination between the proofs and the RDF Dataset containing
 > the
 > non-proof data. That is, the algorithm can surgically remove the
 > proofs that
 > are intended to be used during verification and leave other proofs
 > that are
 > included in the canonicalized data alone. Doing so addresses the
 > recursion/embedding concern that both you and Dan raised.
 >
 > > 3/ Adding extra information after signing might be possible without
 > > affecting verification if the extra information looks like a
 > > signature.
 >
 > Wrong.
 >
 > The LDP algorithms prevent this from happening.
 >
 > Adding extra information after signing changes the hash, which will
 > cause the
 > signature to fail to verify.
 >
 > This is expected behaviour.
 >
 > The Linked Data Proofs algorithms hash and sign *every Quad*. This
 > includes
 > the original RDF Dataset as well as all canonicalized options (i.e.,
 > signature
 > options minus the digital signature itself). This is detailed in the
 > specification here:
 >
 > https://w3c-ccg.github.io/ld-proofs/#create-verify-hash-algorithm
 >
 > This was a very deliberate design choice... other signature schemes,
 > like
 > JWTs, allow unsigned data. LDP takes a more strict approach... you
 > cannot
 > inject a Quad into either the original RDF Dataset OR the
 > canonicalized
 > options and get the same hash (modulo a bonafide hash collision). In
 > other
 > words, you cannot inject anything, anywhere that is covered by the
 > signature
 > (which is everything)... especially "extra information that looks
 > like a
 > signature" because that information is included in the signature.
 >
 > > 4/ The dataset extracted during verification might not be the
 > > dataset used
 > >  during signing because the original document has relative IRIs.
 >
 > Wrong.
 >
 > Relative IRIs are resolved against the base IRI, if the base IRI
 > changes, the
 > dataset changes and the signature will fail to verify.
 >
 > This is expected behaviour.
 >
 > Relative IRI resolution happens before canonicalization occurs. The
 > JSON-LD
 > Playground (and underlying libraries) certainly do this as a part of
 > JSON-LD
 > expansion:
 >
 > https://www.w3.org/TR/json-ld11-api/#iri-expansion
 >
 > RDF 1.1 Concepts states that "Relative IRIs must be resolved against
 > a base
 > IRI to make them absolute. Therefore, the RDF graph serialized in
 > such
 > syntaxes is well-defined only if a base IRI can be established
 > [RFC3986]."
 >
 > We could add language to LDP that states that either 1) all inputs
 > must be
 > well-defined RDF Datasets, 2) all input IRIs MUST be absolute, 3)
 > any input
 > that contains a relative IRI and no base IRI as input is invalid (and
 > do IRI
 > expansion in the canonicalization spec), or some other language that
 > makes
 > this more clear.
 >
 > Again, this is something that an LDS WG should debate and come to
 > consensus on
 > given that the needs here are not just focused on JSON-LD and are not
 > just
 > focused on Verifiable Credentials.
 >
 > > 5/ The dataset extracted during verification might not be the
 > > dataset used
 > >  during signing because the original document is in a serialization
 > > that
 > > uses external resources to generate the dataset (like @context in
 > > JSON-LD)
 > >  and this external resource may have changed.
 >
 > Wrong; this is not a problem -- it's expected behaviour.
 >
 > If an external resource changes in a way that changes the dataset,
 > then the
 > hash for the dataset will change causing the signature to fail to
 > verify.
 >
 > This is expected behaviour.
 >
 > For example, if you pull in a JSON-LD Context (J1) and use it to
 > generate Quads,
 > canonicalize, and sign... and then the context changes to (J2) that
 > changes
 > terms or `@base` or anything else that modifies the IRIs that were
 > signed,
 > when the verifier converts the input to Quads, canonicalizes and
 > checks the
 > signature, the signature will be invalid, because the generated hash
 > changed
 > due to the IRIs in the RDF Dataset changing.
 >
 > > 6/ Only the serialized dataset is signed so changing comments in
 > > serializations that allow comments or other parts of the document
 > > that do
 > > not encode triples or quads results can be done without affecting
 > > the
 > > validity of the signature.  This is particularly problematic for
 > > RDFa.
 >
 > By definition, that is not the problem that the LDS WG is solving. We
 > are
 > signing RDF Datasets, if you have information that lives outside of
 > an RDF
 > Dataset that you need to sign, we can't help you.
 >
 > All information that is signed is in the RDF Dataset. If there is
 > information
 > outside of the RDF Dataset (like comments), then it will not be
 > signed. This
 > is true for ANY digital signature mechanism. This only becomes a
 > problem if an
 > application depends on information that is not signed, at which point
 > the
 > application developer really should consider signing the unsigned
 > information.
 >
 > This is expected behaviour.
 >
 > This is not a problem for RDFa if the information you want to sign is
 > the
 > underlying RDF Dataset. If you want to sign a blob of HTML that
 > contains RDFa,
 > then you need to grab that blob of HTML and encapsulate it in the RDF
 > Dataset
 > and digitally sign that... or you need to use a different digital
 > signature
 > mechanism that just signs everything, including spaces, tabs, and
 > other
 > unnecessary things that if they change, will break the signature.
 >
 > Having the digital proof cover things outside of an RDF Dataset is
 > almost
 > entirely out of scope. The only thing that is in scope is if you want
 > to embed
 > the HTML as a literal, for example... and in that case, you can use
 > an RDF
 > Dataset and LDP to do that.
 >
 > ----------------
 >
 > I hope this explains how all of the problems you raised were either
 > 1) not
 > problems, 2) previously known with mitigations in place 3) solved
 > with a few
 > sentences of documentation, or 4) not an issue and also out of scope
 > of the
 > LDS WG.
 >
 > I hope it's also clear that a large percentage of the questions you
 > had
 > require RDF expertise to understand rather than "security expert"
 > expertise.
 > While we have had input from both RDF experts and security experts,
 > it's still
 > not clear what sort of expertise you're looking to when analysing
 > these
 > algorithms. It's true that you need both sorts of people in the same
 > room, and
 > is thus why we are forming an LDS WG *and* have entities like the
 > IETF
 > Cryptography Forum Research Group, the National Institute of
 > Standards
 > (currently engaged), and other "security experts" listed in the
 > Coordination
 > section:
 >
 > https://w3c.github.io/lds-wg-charter/#coordination
 >
 > I hope these answers were helpful to you and I'm happy to answer
 > other
 > relevant questions you may have.
 >
 > What I would like from you in return are concrete suggestions on
 > changes to
 > the specification, issues raised, or specific parties (by name or
 > detailed
 > qualification) you feel should be a part of the discussion.
 > Requesting that we
 > bring in "security experts" is not helpful... it's like asking if
 > we've had
 > "RDF experts" sign-off on the algorithms. Just about every "real RDF
 > expert" I
 > know would claim that they're not one... because they understand how
 > broad and
 > deep that particular body of water is.
 >
 > -- manu
 >
Received on Wednesday, 26 May 2021 14:09:54 UTC