Re: Chartering work has started for a Linked Data Signature Working Group @W3C from David Booth on 2021-06-07 (public-semweb-lifesci@w3.org from June 2021)

From: David Booth <david@dbooth.org>
Date: Mon, 7 Jun 2021 17:35:22 -0400
To: w3c semweb HCLS <public-semweb-lifesci@w3.org>
Message-ID: <f371e378-4c1e-3407-9d08-fb27b7ea4422@dbooth.org>
Thank you for your work on this!  I think RDF canonicalization is very 
important, and I also see the value in the proposed digital signatures 
work.  But I have two immediate suggestions and one major question.

1. The proposed RDF Dataset Hash (RDH) algorithm talks about "sorting 
the N-Quads serialization of the canonical form [of the RDF Dataset]". 
Clearly the intent is to produce a canonical N-Quads serialization, in 
preparation for hashing.  But at present the charter does not identify 
the Canonical N-Quads serialization algorithm as a named deliverable. 
It definitely should, so that it can be easily referenced and used in 
its own right.

2. In the Use Cases section of the Explainer, I suggest adding a 
diff/patch use case.  I think it would be a huge missed opportunity if 
that were ignored in standardizing an RDF canonicalization algorithm. 
See further explanation below.

3. Although I see the value of an RDF-based digital signatures 
vocabulary, in reading the proposed charter and associated materials I 
have been unable to understand the value in *restricting* this 
vocabulary to source documents that happen to be RDF.  Why not allow it 
to be used on *any* kind of digital source documents?  Cryptographic 
hash algorithms don't care what kind of source document their input 
bytes represent.  Why should this digital signatures vocabulary care 
about the format or language of the source document?  I can imagine a 
digital signatures vocabulary providing a way to formally state 
something like: "if user U signed digital contract C, then it means that 
U has agreed to the terms of contract C".  But I do not yet see why it 
would need to say anything about the format or language of C.  C just is 
whatever it is, whether it's English, RDF or something else.  Can 
someone enlighten me on this point?

Those are my high level comments and question.  Further explanation 
about the diff/patch use case follows.

                 -----------------------------------

Diff/Patch Use Case:
The key consideration that the diff/patch use case adds to 
canonicalization is that a "small" change to an RDF dataset should 
produce a commensurately "small" change in the canonicalized result (to 
the extent possible), at least for common use cases, such as 
adding/deleting a few triples, adding/deleting an RDF molecule/object 
(or a concise bounded description 
https://www.w3.org/Submission/2004/SUBM-CBD-20040930/ or similar), 
adding/deleting a graph from an RDF dataset, adding/deleting list 
elements, or adding/deleting a level of hierarchy in a tree (or tree-ish 
graph).

This requirement is not important for digital signature use cases, but 
it is essential for diff/patch use cases.  And to be clear, this 
requirement ONLY applies to the canonicalization algorithm -- NOT the 
hashing algorithm.  Indeed, a cryptographic hashing algorithm must have 
exactly the opposite property: a small change in the input must produce 
a LARGE (random) change in the output.

If the proposed canonicalization algorithms already meet this 
requirement, then that would be great.  But I am not aware of any 
testing that has been done on them with this use case in mind, to find 
out whether they would meet this requirement.  And I do think this 
requirement is important for a general purpose canonicalization standard.

Proposed text for use cases:
[[
Diff/patch of RDF Datasets
For diff/patch applications that need to track changes to RDF datasets, 
or keep two RDF datasets in sync by applying incremental changes, the 
N-Quads Canonicalization algorithm should make best efforts to produce 
results that are well suited for use with existing line-oriented diff 
and patch tools.  This means that, given a "small" change to an RDF 
dataset -- i.e., changing only a "small" number of lines in an N-Quads 
representation -- the N-Quads Canonicalization Algorithm will most 
likely produce a commensurately "small" change in its canonicalized 
result, for common diff/patch use cases.  Common use cases include 
adding/deleting a few triples, adding/deleting an RDF molecule/object 
(or a concise bounded description 
https://www.w3.org/Submission/2004/SUBM-CBD-20040930/ or similar), 
adding/deleting a graph from an RDF dataset, adding/deleting list 
elements, or adding/deleting a level of hierarchy in a tree (or tree-ish 
graph)
Requirement: A Diff-Friendly N-Quads Canonicalization Algorithm
]]

Thanks,
David Booth

On 4/6/21 6:20 AM, Ivan Herman wrote:
> Dear all,
> 
> the W3C has started to work on a Working Group charter for Linked Data 
> Signatures:
> 
> https://w3c.github.io/lds-wg-charter/index.html 
> <https://w3c.github.io/lds-wg-charter/index.html>
> 
> The work proposed in this Working Group includes Linked Data 
> Canonicalization, as well as algorithms and vocabularies for encoding 
> digital proofs, such as digital signatures, and with that secure 
> information expressed in serializations such as JSON-LD, TriG, and N-Quads.
> 
> The need for Linked Data canonicalization, digest, or signature has been 
> known for a very long time, but it is only in recent years that research 
> and development has resulted in mathematical algorithms and related 
> implementations that are on the maturity level for a Web Standard. A 
> separate explainer document:
> 
> https://w3c.github.io/lds-wg-charter/explainer.html 
> <https://w3c.github.io/lds-wg-charter/explainer.html>
> 
> provides some background, as well as a small set of use cases.
> 
> The W3C Credentials Community Group[1,2] has been instrumental in the 
> work leading to this charter proposal, not the least due to its work on 
> Verifiable Credentials and with recent applications and development on, 
> e.g., vaccination passports using those technologies.
> 
> It must be emphasized, however, that this work is not bound to a 
> specific application area or serialization. There are numerous use cases 
> in Linked Data, like the publication of biological and pharmaceutical 
> data, consumption of mission critical RDF vocabularies, and others, that 
> depend on the ability to verify the authenticity and integrity of the 
> data being consumed. This Working Group aims at covering all those, and 
> we hope to involve the Linked Data Community at large in the elaboration 
> of the final charter proposal.
> 
> We welcome your general expressions of interest and support. If you wish 
> to make your comments public, please use GitHub issues:
> 
> https://github.com/w3c/lds-wg-charter/issues 
> <https://github.com/w3c/lds-wg-charter/issues>
> 
> A formal W3C Advisory Committee Review for this charter is expected in 
> about six weeks.
> 
> [1] https://www.w3.org/community/credentials/ 
> <https://www.w3.org/community/credentials/>
> [2] https://w3c-ccg.github.io/ <https://w3c-ccg.github.io/>
> 
> 
> ----
> Ivan Herman, W3C
> Home: http://www.w3.org/People/Ivan/ <http://www.w3.org/People/Ivan/>
> mobile: +33 6 52 46 00 43
> ORCID ID: https://orcid.org/0000-0003-0782-2704 
> <https://orcid.org/0000-0003-0782-2704>
>
Received on Monday, 7 June 2021 21:36:27 UTC