Re: Chartering work has started for a Linked Data Signature Working Group @W3C from David Booth on 2021-06-08 (public-semweb-lifesci@w3.org from June 2021)

From: David Booth <david@dbooth.org>
Date: Mon, 7 Jun 2021 23:00:24 -0400
To: public-semweb-lifesci@w3.org
Message-ID: <945275ab-4a60-b34e-f86f-c35681dbe230@dbooth.org>
On 6/7/21 10:03 PM, Jamie McCusker wrote:
> I really think that canonicalizing RDF graphs by sorting their 
> statements is a mistake. Obviously I'm biased towards the approach I 
> used in RGDA1 (see implementation in RDFlib and writeup in my 
> dissertation) of Sayers and Karp with Nauty-based canonicalization of 
> bnodes, but the process does not need to and should not involve sorting 
> and serializing graphs in order to create a digest for them.

Why?  What downsides do you see, aside from the processing required to 
perform the serialization?

And if you are not serializing in order to create a digest, then 
presumably you would be using a custom digest algorithm, rather than a 
standard off-the-shelf digest algorithm that simply works on bytes.  If 
so, how do you justify that choice, given that home grown security 
algorithms are notoriously insecure?

David Booth

> 
> Thanks,
> Jamie
> 
> On Mon, Jun 7, 2021 at 5:42 PM David Booth <david@dbooth.org 
> <mailto:david@dbooth.org>> wrote:
> 
>     Thank you for your work on this!  I think RDF canonicalization is very
>     important, and I also see the value in the proposed digital signatures
>     work.  But I have two immediate suggestions and one major question.
> 
>     1. The proposed RDF Dataset Hash (RDH) algorithm talks about "sorting
>     the N-Quads serialization of the canonical form [of the RDF Dataset]".
>     Clearly the intent is to produce a canonical N-Quads serialization, in
>     preparation for hashing.  But at present the charter does not identify
>     the Canonical N-Quads serialization algorithm as a named deliverable.
>     It definitely should, so that it can be easily referenced and used in
>     its own right.
> 
>     2. In the Use Cases section of the Explainer, I suggest adding a
>     diff/patch use case.  I think it would be a huge missed opportunity if
>     that were ignored in standardizing an RDF canonicalization algorithm.
>     See further explanation below.
> 
>     3. Although I see the value of an RDF-based digital signatures
>     vocabulary, in reading the proposed charter and associated materials I
>     have been unable to understand the value in *restricting* this
>     vocabulary to source documents that happen to be RDF.  Why not allow it
>     to be used on *any* kind of digital source documents?  Cryptographic
>     hash algorithms don't care what kind of source document their input
>     bytes represent.  Why should this digital signatures vocabulary care
>     about the format or language of the source document?  I can imagine a
>     digital signatures vocabulary providing a way to formally state
>     something like: "if user U signed digital contract C, then it means
>     that
>     U has agreed to the terms of contract C".  But I do not yet see why it
>     would need to say anything about the format or language of C.  C
>     just is
>     whatever it is, whether it's English, RDF or something else.  Can
>     someone enlighten me on this point?
> 
>     Those are my high level comments and question.  Further explanation
>     about the diff/patch use case follows.
> 
>                       -----------------------------------
> 
>     Diff/Patch Use Case:
>     The key consideration that the diff/patch use case adds to
>     canonicalization is that a "small" change to an RDF dataset should
>     produce a commensurately "small" change in the canonicalized result (to
>     the extent possible), at least for common use cases, such as
>     adding/deleting a few triples, adding/deleting an RDF molecule/object
>     (or a concise bounded description
>     https://www.w3.org/Submission/2004/SUBM-CBD-20040930/
>     <https://www.w3.org/Submission/2004/SUBM-CBD-20040930/> or similar),
>     adding/deleting a graph from an RDF dataset, adding/deleting list
>     elements, or adding/deleting a level of hierarchy in a tree (or
>     tree-ish
>     graph).
> 
>     This requirement is not important for digital signature use cases, but
>     it is essential for diff/patch use cases.  And to be clear, this
>     requirement ONLY applies to the canonicalization algorithm -- NOT the
>     hashing algorithm.  Indeed, a cryptographic hashing algorithm must have
>     exactly the opposite property: a small change in the input must produce
>     a LARGE (random) change in the output.
> 
>     If the proposed canonicalization algorithms already meet this
>     requirement, then that would be great.  But I am not aware of any
>     testing that has been done on them with this use case in mind, to find
>     out whether they would meet this requirement.  And I do think this
>     requirement is important for a general purpose canonicalization
>     standard.
> 
>     Proposed text for use cases:
>     [[
>     Diff/patch of RDF Datasets
>     For diff/patch applications that need to track changes to RDF datasets,
>     or keep two RDF datasets in sync by applying incremental changes, the
>     N-Quads Canonicalization algorithm should make best efforts to produce
>     results that are well suited for use with existing line-oriented diff
>     and patch tools.  This means that, given a "small" change to an RDF
>     dataset -- i.e., changing only a "small" number of lines in an N-Quads
>     representation -- the N-Quads Canonicalization Algorithm will most
>     likely produce a commensurately "small" change in its canonicalized
>     result, for common diff/patch use cases.  Common use cases include
>     adding/deleting a few triples, adding/deleting an RDF molecule/object
>     (or a concise bounded description
>     https://www.w3.org/Submission/2004/SUBM-CBD-20040930/
>     <https://www.w3.org/Submission/2004/SUBM-CBD-20040930/> or similar),
>     adding/deleting a graph from an RDF dataset, adding/deleting list
>     elements, or adding/deleting a level of hierarchy in a tree (or
>     tree-ish
>     graph)
>     Requirement: A Diff-Friendly N-Quads Canonicalization Algorithm
>     ]]
> 
>     Thanks,
>     David Booth
> 
>     On 4/6/21 6:20 AM, Ivan Herman wrote:
>      > Dear all,
>      >
>      > the W3C has started to work on a Working Group charter for Linked
>     Data
>      > Signatures:
>      >
>      > https://w3c.github.io/lds-wg-charter/index.html
>     <https://w3c.github.io/lds-wg-charter/index.html>
>      > <https://w3c.github.io/lds-wg-charter/index.html
>     <https://w3c.github.io/lds-wg-charter/index.html>>
>      >
>      > The work proposed in this Working Group includes Linked Data
>      > Canonicalization, as well as algorithms and vocabularies for
>     encoding
>      > digital proofs, such as digital signatures, and with that secure
>      > information expressed in serializations such as JSON-LD, TriG,
>     and N-Quads.
>      >
>      > The need for Linked Data canonicalization, digest, or signature
>     has been
>      > known for a very long time, but it is only in recent years that
>     research
>      > and development has resulted in mathematical algorithms and related
>      > implementations that are on the maturity level for a Web Standard. A
>      > separate explainer document:
>      >
>      > https://w3c.github.io/lds-wg-charter/explainer.html
>     <https://w3c.github.io/lds-wg-charter/explainer.html>
>      > <https://w3c.github.io/lds-wg-charter/explainer.html
>     <https://w3c.github.io/lds-wg-charter/explainer.html>>
>      >
>      > provides some background, as well as a small set of use cases.
>      >
>      > The W3C Credentials Community Group[1,2] has been instrumental in
>     the
>      > work leading to this charter proposal, not the least due to its
>     work on
>      > Verifiable Credentials and with recent applications and
>     development on,
>      > e.g., vaccination passports using those technologies.
>      >
>      > It must be emphasized, however, that this work is not bound to a
>      > specific application area or serialization. There are numerous
>     use cases
>      > in Linked Data, like the publication of biological and
>     pharmaceutical
>      > data, consumption of mission critical RDF vocabularies, and
>     others, that
>      > depend on the ability to verify the authenticity and integrity of
>     the
>      > data being consumed. This Working Group aims at covering all
>     those, and
>      > we hope to involve the Linked Data Community at large in the
>     elaboration
>      > of the final charter proposal.
>      >
>      > We welcome your general expressions of interest and support. If
>     you wish
>      > to make your comments public, please use GitHub issues:
>      >
>      > https://github.com/w3c/lds-wg-charter/issues
>     <https://github.com/w3c/lds-wg-charter/issues>
>      > <https://github.com/w3c/lds-wg-charter/issues
>     <https://github.com/w3c/lds-wg-charter/issues>>
>      >
>      > A formal W3C Advisory Committee Review for this charter is
>     expected in
>      > about six weeks.
>      >
>      > [1] https://www.w3.org/community/credentials/
>     <https://www.w3.org/community/credentials/>
>      > <https://www.w3.org/community/credentials/
>     <https://www.w3.org/community/credentials/>>
>      > [2] https://w3c-ccg.github.io/ <https://w3c-ccg.github.io/>
>     <https://w3c-ccg.github.io/ <https://w3c-ccg.github.io/>>
>      >
>      >
>      > ----
>      > Ivan Herman, W3C
>      > Home: http://www.w3.org/People/Ivan/
>     <http://www.w3.org/People/Ivan/> <http://www.w3.org/People/Ivan/
>     <http://www.w3.org/People/Ivan/>>
>      > mobile: +33 6 52 46 00 43
>      > ORCID ID: https://orcid.org/0000-0003-0782-2704
>     <https://orcid.org/0000-0003-0782-2704>
>      > <https://orcid.org/0000-0003-0782-2704
>     <https://orcid.org/0000-0003-0782-2704>>
>      >
> 
> 
> 
> -- 
> Jamie McCusker (they<she)
> 
> Director, Data Operations
> Tetherless World Constellation
> Rensselaer Polytechnic Institute
> mccusj2@rpi.edu <mailto:mccusj@cs.rpi.edu>
> http://tw.rpi.edu <http://tw.rpi.edu/>
Received on Tuesday, 8 June 2021 03:01:27 UTC