Re: Chartering work has started for a Linked Data Signature Working Group @W3C

For one, Sayers and Karp build off off the shelf algorithms by applying an
aggregation function over statement-wise content digests. RGDA1 uses SUM,
but PRODUCT works almost as well (but needs an overflow limit defined). The
main reason is, why sort and serialize a billion statement graph when it
can simply be iterated over in-memory? Not only does it not scale, it
wouldn't support streaming either. Distributed (disjoint) graphs can also
be digested in-place, and the resulting digests can in turn be aggregated
together for a final result.

Additionally, databases can keep a running digest of their contents by
adding and subtracting the relevant statements as they are added and
removed. That's pretty much impossible with a canonicalized serialization
approach. The same is true of streaming SPARQL.

With this approach, there is constant space requirements and linear time
requirements, instead of linear (in memory?) space and O(n log n) time
requirements. But maybe we should consult a content digest expert on this,
if you're concerned about the security of such uses of content digests?

Thanks,
Jamie

On Mon, Jun 7, 2021 at 11:07 PM David Booth <david@dbooth.org> wrote:

> On 6/7/21 10:03 PM, Jamie McCusker wrote:
> > I really think that canonicalizing RDF graphs by sorting their
> > statements is a mistake. Obviously I'm biased towards the approach I
> > used in RGDA1 (see implementation in RDFlib and writeup in my
> > dissertation) of Sayers and Karp with Nauty-based canonicalization of
> > bnodes, but the process does not need to and should not involve sorting
> > and serializing graphs in order to create a digest for them.
>
> Why?  What downsides do you see, aside from the processing required to
> perform the serialization?
>
> And if you are not serializing in order to create a digest, then
> presumably you would be using a custom digest algorithm, rather than a
> standard off-the-shelf digest algorithm that simply works on bytes.  If
> so, how do you justify that choice, given that home grown security
> algorithms are notoriously insecure?
>
> David Booth
>
> >
> > Thanks,
> > Jamie
> >
> > On Mon, Jun 7, 2021 at 5:42 PM David Booth <david@dbooth.org
> > <mailto:david@dbooth.org>> wrote:
> >
> >     Thank you for your work on this!  I think RDF canonicalization is
> very
> >     important, and I also see the value in the proposed digital
> signatures
> >     work.  But I have two immediate suggestions and one major question.
> >
> >     1. The proposed RDF Dataset Hash (RDH) algorithm talks about "sorting
> >     the N-Quads serialization of the canonical form [of the RDF
> Dataset]".
> >     Clearly the intent is to produce a canonical N-Quads serialization,
> in
> >     preparation for hashing.  But at present the charter does not
> identify
> >     the Canonical N-Quads serialization algorithm as a named deliverable.
> >     It definitely should, so that it can be easily referenced and used in
> >     its own right.
> >
> >     2. In the Use Cases section of the Explainer, I suggest adding a
> >     diff/patch use case.  I think it would be a huge missed opportunity
> if
> >     that were ignored in standardizing an RDF canonicalization algorithm.
> >     See further explanation below.
> >
> >     3. Although I see the value of an RDF-based digital signatures
> >     vocabulary, in reading the proposed charter and associated materials
> I
> >     have been unable to understand the value in *restricting* this
> >     vocabulary to source documents that happen to be RDF.  Why not allow
> it
> >     to be used on *any* kind of digital source documents?  Cryptographic
> >     hash algorithms don't care what kind of source document their input
> >     bytes represent.  Why should this digital signatures vocabulary care
> >     about the format or language of the source document?  I can imagine a
> >     digital signatures vocabulary providing a way to formally state
> >     something like: "if user U signed digital contract C, then it means
> >     that
> >     U has agreed to the terms of contract C".  But I do not yet see why
> it
> >     would need to say anything about the format or language of C.  C
> >     just is
> >     whatever it is, whether it's English, RDF or something else.  Can
> >     someone enlighten me on this point?
> >
> >     Those are my high level comments and question.  Further explanation
> >     about the diff/patch use case follows.
> >
> >                       -----------------------------------
> >
> >     Diff/Patch Use Case:
> >     The key consideration that the diff/patch use case adds to
> >     canonicalization is that a "small" change to an RDF dataset should
> >     produce a commensurately "small" change in the canonicalized result
> (to
> >     the extent possible), at least for common use cases, such as
> >     adding/deleting a few triples, adding/deleting an RDF molecule/object
> >     (or a concise bounded description
> >     https://www.w3.org/Submission/2004/SUBM-CBD-20040930/
> >     <https://www.w3.org/Submission/2004/SUBM-CBD-20040930/> or similar),
> >     adding/deleting a graph from an RDF dataset, adding/deleting list
> >     elements, or adding/deleting a level of hierarchy in a tree (or
> >     tree-ish
> >     graph).
> >
> >     This requirement is not important for digital signature use cases,
> but
> >     it is essential for diff/patch use cases.  And to be clear, this
> >     requirement ONLY applies to the canonicalization algorithm -- NOT the
> >     hashing algorithm.  Indeed, a cryptographic hashing algorithm must
> have
> >     exactly the opposite property: a small change in the input must
> produce
> >     a LARGE (random) change in the output.
> >
> >     If the proposed canonicalization algorithms already meet this
> >     requirement, then that would be great.  But I am not aware of any
> >     testing that has been done on them with this use case in mind, to
> find
> >     out whether they would meet this requirement.  And I do think this
> >     requirement is important for a general purpose canonicalization
> >     standard.
> >
> >     Proposed text for use cases:
> >     [[
> >     Diff/patch of RDF Datasets
> >     For diff/patch applications that need to track changes to RDF
> datasets,
> >     or keep two RDF datasets in sync by applying incremental changes, the
> >     N-Quads Canonicalization algorithm should make best efforts to
> produce
> >     results that are well suited for use with existing line-oriented diff
> >     and patch tools.  This means that, given a "small" change to an RDF
> >     dataset -- i.e., changing only a "small" number of lines in an
> N-Quads
> >     representation -- the N-Quads Canonicalization Algorithm will most
> >     likely produce a commensurately "small" change in its canonicalized
> >     result, for common diff/patch use cases.  Common use cases include
> >     adding/deleting a few triples, adding/deleting an RDF molecule/object
> >     (or a concise bounded description
> >     https://www.w3.org/Submission/2004/SUBM-CBD-20040930/
> >     <https://www.w3.org/Submission/2004/SUBM-CBD-20040930/> or similar),
> >     adding/deleting a graph from an RDF dataset, adding/deleting list
> >     elements, or adding/deleting a level of hierarchy in a tree (or
> >     tree-ish
> >     graph)
> >     Requirement: A Diff-Friendly N-Quads Canonicalization Algorithm
> >     ]]
> >
> >     Thanks,
> >     David Booth
> >
> >     On 4/6/21 6:20 AM, Ivan Herman wrote:
> >      > Dear all,
> >      >
> >      > the W3C has started to work on a Working Group charter for Linked
> >     Data
> >      > Signatures:
> >      >
> >      > https://w3c.github.io/lds-wg-charter/index.html
> >     <https://w3c.github.io/lds-wg-charter/index.html>
> >      > <https://w3c.github.io/lds-wg-charter/index.html
> >     <https://w3c.github.io/lds-wg-charter/index.html>>
> >      >
> >      > The work proposed in this Working Group includes Linked Data
> >      > Canonicalization, as well as algorithms and vocabularies for
> >     encoding
> >      > digital proofs, such as digital signatures, and with that secure
> >      > information expressed in serializations such as JSON-LD, TriG,
> >     and N-Quads.
> >      >
> >      > The need for Linked Data canonicalization, digest, or signature
> >     has been
> >      > known for a very long time, but it is only in recent years that
> >     research
> >      > and development has resulted in mathematical algorithms and
> related
> >      > implementations that are on the maturity level for a Web
> Standard. A
> >      > separate explainer document:
> >      >
> >      > https://w3c.github.io/lds-wg-charter/explainer.html
> >     <https://w3c.github.io/lds-wg-charter/explainer.html>
> >      > <https://w3c.github.io/lds-wg-charter/explainer.html
> >     <https://w3c.github.io/lds-wg-charter/explainer.html>>
> >      >
> >      > provides some background, as well as a small set of use cases.
> >      >
> >      > The W3C Credentials Community Group[1,2] has been instrumental in
> >     the
> >      > work leading to this charter proposal, not the least due to its
> >     work on
> >      > Verifiable Credentials and with recent applications and
> >     development on,
> >      > e.g., vaccination passports using those technologies.
> >      >
> >      > It must be emphasized, however, that this work is not bound to a
> >      > specific application area or serialization. There are numerous
> >     use cases
> >      > in Linked Data, like the publication of biological and
> >     pharmaceutical
> >      > data, consumption of mission critical RDF vocabularies, and
> >     others, that
> >      > depend on the ability to verify the authenticity and integrity of
> >     the
> >      > data being consumed. This Working Group aims at covering all
> >     those, and
> >      > we hope to involve the Linked Data Community at large in the
> >     elaboration
> >      > of the final charter proposal.
> >      >
> >      > We welcome your general expressions of interest and support. If
> >     you wish
> >      > to make your comments public, please use GitHub issues:
> >      >
> >      > https://github.com/w3c/lds-wg-charter/issues
> >     <https://github.com/w3c/lds-wg-charter/issues>
> >      > <https://github.com/w3c/lds-wg-charter/issues
> >     <https://github.com/w3c/lds-wg-charter/issues>>
> >      >
> >      > A formal W3C Advisory Committee Review for this charter is
> >     expected in
> >      > about six weeks.
> >      >
> >      > [1] https://www.w3.org/community/credentials/
> >     <https://www.w3.org/community/credentials/>
> >      > <https://www.w3.org/community/credentials/
> >     <https://www.w3.org/community/credentials/>>
> >      > [2] https://w3c-ccg.github.io/ <https://w3c-ccg.github.io/>
> >     <https://w3c-ccg.github.io/ <https://w3c-ccg.github.io/>>
> >      >
> >      >
> >      > ----
> >      > Ivan Herman, W3C
> >      > Home: http://www.w3.org/People/Ivan/
> >     <http://www.w3.org/People/Ivan/> <http://www.w3.org/People/Ivan/
> >     <http://www.w3.org/People/Ivan/>>
> >      > mobile: +33 6 52 46 00 43
> >      > ORCID ID: https://orcid.org/0000-0003-0782-2704
> >     <https://orcid.org/0000-0003-0782-2704>
> >      > <https://orcid.org/0000-0003-0782-2704
> >     <https://orcid.org/0000-0003-0782-2704>>
> >      >
> >
> >
> >
> > --
> > Jamie McCusker (they<she)
> >
> > Director, Data Operations
> > Tetherless World Constellation
> > Rensselaer Polytechnic Institute
> > mccusj2@rpi.edu <mailto:mccusj@cs.rpi.edu>
> > http://tw.rpi.edu <http://tw.rpi.edu/>
>
> --
Jamie McCusker (they<she)

Director, Data Operations
Tetherless World Constellation
Rensselaer Polytechnic Institute
mccusj2@rpi.edu <mccusj@cs.rpi.edu>
http://tw.rpi.edu

Received on Tuesday, 8 June 2021 04:23:09 UTC