Re: VC-JWT perma-thread (was: Re: RDF Dataset Canonicalization - Formal Proof) from David Waite on 2021-03-31 (public-credentials@w3.org from March 2021)

From: David Waite <dwaite@pingidentity.com>
Date: Tue, 30 Mar 2021 23:08:18 -0600
To: Dave Longley <dlongley@digitalbazaar.com>
Cc: Orie Steele <orie@transmute.industries>, Manu Sporny <msporny@digitalbazaar.com>, Credentials Community Group <public-credentials@w3.org>
Message-ID: <CA+3kW=aXYKV-zbJta2jGCwoNmFhXgpE8vgviOOU1h__Oqn6efQ@mail.gmail.com>
Yes, you must restrict JSON-LD to further resemble a static JSON format if
you use the received data as JSON.

The two most obvious are errors on unrecognized properties and errors if
the @context has custom values (you likely want to just ban anything other
than references to pre-cached documents). Implementors must adopt all of
these or they will likely have a system that falls into an existing
playbook of exploits.

Alternatively, you could operate solely on the canonicalized data, either:
1. Write your tools to operate on the RDF
2. Discard the received JSON, convert the RDF back into JSON-LD using a
static context, and do JSON-based processing on _that_

I personally try to avoid plaintext signed data because of the number of
instances I've seen of RDD (Regex-Driven Development). The easiest way to
validate an LD-Proof is to discard the proof :-)

-DW


On Tue, Mar 30, 2021 at 10:43 AM Dave Longley <dlongley@digitalbazaar.com>
wrote:

>
> On 3/30/21 11:43 AM, Orie Steele wrote:
> > Overall I agree with a lot of David's comments.
> >
> > In particular, I have seen the following issues with LD Proofs:
> >
> > 1. silently dropping terms, instead of throwing an error. (allows an
> > attacker to inject certain terms are dropped).
> > 2. poor implementations loading contexts over the network (DNS
> > poisoning, latency attacks)
> > 3. @vocab and other language "features" making it hard to tell what you
> > are actually signing
> > 4. documentation / controller ship issues with vocab (same problem as
> > JOSE, things need to be registered and documented somewhere)
> >
> > 3) is easy to fix, @vocab should result in an error being thrown in any
> > security context. https://github.com/w3c/vc-data-model/issues/753
> >
> > Note that 3 applies to all VC formats, regardless of the proof /
> > signature format.
> >
> > 2) is very easy to fix, just pass a document loader that never makes
> > network requests to any software you want to never make network requests
> > and make sure the software still passes all its tests...
> >
> > 1.) is the most critical imo, different implementations handle this
> > issue differently.
> >
> > IMO the correct behavior is to throw when ANY undefined term is
> > detected, and halt immediately. Implementations that silently dropped
> > properties have created a massive security issue for us on this front...
> > and its related to canonicalization, essentially if your
> > canonicalization alg silently drops any information its a security
> > vulnerability... the default behavior of any such algorithm should be to
> > throw.
>
> +1, I agree and think we can address the issue by being strict in this
> manner. If you pass in some JSON-LD (or other LD format) to a
> sign/verify API and any terms are not defined, you'll get an error. This
> creates the security binding/boundaries that we want whilst still
> allowing us to enjoy benefits we get from canonicalization.
>
> >
> > There is a kind of pseudo canonicalization that every digital signature
> > system relies on... and it's called a hash function. There are a number
> > of reasons that hash functions are used with digital signatures, and a
> > number of attacks that have results from poor choice of hash functions:
> >
> > -
> https://blog.torproject.org/md5-certificate-collision-attack-and-what-it-means-tor
> > -
> >
> https://www.zdnet.com/article/sha-1-collision-attacks-are-now-actually-practical-and-a-looming-danger/
> >
> > Yes, there are problems with complexity in the data that is hashed
> > before a signature is applied, but none as deadly as picking a poor hash
> > function.
> >
> > in JOSE, what is signed is "base64(json(header)).base64(json(payload))"
> >
> > in LD Proofs, what is signed is
> > "sha256(canonicalize(header))sha256(canonicalize(document)) "
> >
> > See https://docs.joinmastodon.org/spec/security for another
> explanation...
> >
> > In both cases, the signature algorithm likely hashes this message before
> > signing with EdDSA or ECDSA, etc...
> >
> > A couple observations....
> >
> > base64 in jose is a form of canonicalizing... because header and payload
> > objects might have different orderings, but base64url encoding makes
> > those orderings opaque... by inflating them 33%.
> >
> > canonicalize in the LD Proof could be JCS, or simple sorting of JSON
> > Keys... or RDF Data Set Normalization... each would yield a different
> > signature...
> >
> > mechanically, the fact that JCS exists hints at the problem with JOSE...
> > if you want to sign things, you want stable hashes, and therefore
> > need SOME form of canonicalization for complex data structures.
> >
> > JOSE works very well for small id tokens, like the ones that are used in
> > OIDC / OAuth... JOSE totally doesn't scale to signatures over large data
> > sets without another tool.
> >
> > "Detached JWS with Unencoded Payload":
> >
> > https://tools.ietf.org/html/rfc7515#appendix-F
> > https://tools.ietf.org/html/rfc7797
> >
> > This is how the JWS for LD Proofs are generated, and the "Unencoded
> > payload part" is the result of the canonicalization algorithm....
> >
> > What would happen if we just decided to use "Unencoded Payload" without
> > canonicalization?... maybe we just use JSON.stringify?
> >
> > it still works!... sorta... now I can generate a new message and
> > signature for every ordering of data in the payload... for a really
> > complex and very large payload, that's going to be a LOT of deeply equal
> > objects... that each yield a different signature... this can lead to
> > storing a massive amount of redundant but indistinguishable data...
> > which can lead to resource exhaustion attacks.
> >
> > In fact, the sidetree protocol uses JCS for this exact
> > reason... https://identity.foundation/sidetree/spec/#default-parameters
> >
> > So in summary, in any JOSE library you can replace JSON with JCS and get
> > better signatures, and developers will thank you because they won't be
> > tracking down bugs related to duplicate content... and canonicalization
> > can also lead to security issues if not handled properly... regardless
> > of how you canonicalize things.
> >
> > Regards,
> >
> > OS
> >
> >
> >
> > On Tue, Mar 30, 2021 at 1:47 AM David Waite <dwaite@pingidentity.com
> > <mailto:dwaite@pingidentity.com>> wrote:
> >
> >     On 3/27/21 11:12 AM, David Chadwick wrote:
> >     > This is a major benefit of using JWS/JWT, as canonicalisation has
> >     been
> >     > fraught with difficulties (as anybody who has worked with XML
> >     signatures
> >     > will know, and discussions in the IETF PKIX group have
> highlighted).
> >
> >     On Mar 27, 2021, 9:26 AM, Manu Sporny wrote:
> >
> >         Anyone who believes that RDF Dataset Canonicalization is the
> >         same problem as
> >         XML Canonicalization does not understand the problem space.
> >         These are two very
> >         different problem spaces with very different solutions.
> >
> >
> >     There have been interoperability issues with XML canonicalization,
> >     but the impact of those _pale_ in comparison to the security issues.
> >     JOSE was adopted as a next step for signed data for many use cases
> >     both for interoperability and for security reasons.
> >
> >     It is crucially important to remember that for current LD proofs:
> >     - the canonicalization algorithm determines which details are
> >     critical and which are ignorable
> >     - the proof algorithms specify an canonicalization algorithm, there
> >     is no guarantee that URDNA2015 will always be the one chosen
> >     - JSON-LD is not just for serialization of RDF, but for the
> >     interpretation of JSON as RDF.
> >
> >     You need security considerations for processing a JSON-encoded
> >     document following a successful LD Proof. This is because you did
> >     not prove the JSON was integrity-protected, but that the RDF
> >     interpretation of the JSON by some canonicalization algorithm
> >     (itself an interpretation based on some JSON-LD context) was
> protected.
> >
> >     And these were the problems with XML Signatures and XML
> >     Canonicalization. Developers want clean abstractions, and _need_
> >     clean abstractions for security boundaries. Canonicalization and
> >     document transformations mean a developer must process the data in
> >     the same way as the security layer, lest you have potential security
> >     vulnerabilities.
> >
> >     I imagine that eventually there will eventually be a desire to
> >     separately sign different subsets of the RDF dataset for large
> >     datasets (like graph databases), or to support the proof being
> >     external to the dataset rather than being represented as part of the
> >     dataset, and so on. These complexities in XML canonicalization and
> >     signatures introduced security vulnerabilities. Even with
> >     correct signature library implementations, the application code
> >     interpreting the data did not necessarily rise to the same level of
> >     sophistication.
> >
> >     JOSE for this reason chose a 'sealed envelope' approach to signing
> >     and encryption, where the data is opaque to the security layer and
> >     vice-versa. The abstraction isn't in some canonical interpretation
> >     of the application data, but that the data is byte-for-byte
> >     identical to what was signed.
> >
> >     This is why JSON Clear Signatures had so little interest from the
> >     JOSE community at large. The problem wasn't that we couldn't imagine
> >     a canonicalization of JSON, it was that so many had been burned by
> >     all the edge cases that grew out of that flexibility in the past.
> >     For those who cared about saving 25%+ of their data cost by wrapping
> >     (potentially) binary data in a text-safe format, CBOR/COSE became
> >     available.
> >
> >     -DW
> >
> >     P.S. this is completely ignoring the issues of DNS-style 'poisoning'
> >     if you accept data from non-authoritative sources based purely on it
> >     being signed, then treat that data as part of a cache or as an
> >     update to your own persistent data set. This was an uncommon problem
> >     in XML since most XML-based formats did not support embedding
> >     external resources.
> >
> >     /CONFIDENTIALITY NOTICE: This email may contain confidential and
> >     privileged material for the sole use of the intended recipient(s).
> >     Any review, use, distribution or disclosure by others is strictly
> >     prohibited.  If you have received this communication in error,
> >     please notify the sender immediately by e-mail and delete the
> >     message and any file attachments from your computer. Thank you./
> >
> >
> >
> > --
> > *ORIE STEELE*
> > Chief Technical Officer
> > www.transmute.industries
> >
> > <https://www.transmute.industries>
>
>
> --
> Dave Longley
> CTO
> Digital Bazaar, Inc.
>

-- 
_CONFIDENTIALITY NOTICE: This email may contain confidential and privileged 
material for the sole use of the intended recipient(s). Any review, use, 
distribution or disclosure by others is strictly prohibited.  If you have 
received this communication in error, please notify the sender immediately 
by e-mail and delete the message and any file attachments from your 
computer. Thank you._
Received on Wednesday, 31 March 2021 05:08:43 UTC