- From: Jamie McCusker <mccusker@gmail.com>
- Date: Mon, 7 Jun 2021 22:03:04 -0400
- To: David Booth <david@dbooth.org>
- Cc: w3c semweb HCLS <public-semweb-lifesci@w3.org>
- Message-ID: <CAAtgn=SaDHR2zAbSVCT5M3dBcZ1hRN30PC+9UGXDHYfek9BqJQ@mail.gmail.com>
I really think that canonicalizing RDF graphs by sorting their statements is a mistake. Obviously I'm biased towards the approach I used in RGDA1 (see implementation in RDFlib and writeup in my dissertation) of Sayers and Karp with Nauty-based canonicalization of bnodes, but the process does not need to and should not involve sorting and serializing graphs in order to create a digest for them. Thanks, Jamie On Mon, Jun 7, 2021 at 5:42 PM David Booth <david@dbooth.org> wrote: > Thank you for your work on this! I think RDF canonicalization is very > important, and I also see the value in the proposed digital signatures > work. But I have two immediate suggestions and one major question. > > 1. The proposed RDF Dataset Hash (RDH) algorithm talks about "sorting > the N-Quads serialization of the canonical form [of the RDF Dataset]". > Clearly the intent is to produce a canonical N-Quads serialization, in > preparation for hashing. But at present the charter does not identify > the Canonical N-Quads serialization algorithm as a named deliverable. > It definitely should, so that it can be easily referenced and used in > its own right. > > 2. In the Use Cases section of the Explainer, I suggest adding a > diff/patch use case. I think it would be a huge missed opportunity if > that were ignored in standardizing an RDF canonicalization algorithm. > See further explanation below. > > 3. Although I see the value of an RDF-based digital signatures > vocabulary, in reading the proposed charter and associated materials I > have been unable to understand the value in *restricting* this > vocabulary to source documents that happen to be RDF. Why not allow it > to be used on *any* kind of digital source documents? Cryptographic > hash algorithms don't care what kind of source document their input > bytes represent. Why should this digital signatures vocabulary care > about the format or language of the source document? I can imagine a > digital signatures vocabulary providing a way to formally state > something like: "if user U signed digital contract C, then it means that > U has agreed to the terms of contract C". But I do not yet see why it > would need to say anything about the format or language of C. C just is > whatever it is, whether it's English, RDF or something else. Can > someone enlighten me on this point? > > Those are my high level comments and question. Further explanation > about the diff/patch use case follows. > > ----------------------------------- > > Diff/Patch Use Case: > The key consideration that the diff/patch use case adds to > canonicalization is that a "small" change to an RDF dataset should > produce a commensurately "small" change in the canonicalized result (to > the extent possible), at least for common use cases, such as > adding/deleting a few triples, adding/deleting an RDF molecule/object > (or a concise bounded description > https://www.w3.org/Submission/2004/SUBM-CBD-20040930/ or similar), > adding/deleting a graph from an RDF dataset, adding/deleting list > elements, or adding/deleting a level of hierarchy in a tree (or tree-ish > graph). > > This requirement is not important for digital signature use cases, but > it is essential for diff/patch use cases. And to be clear, this > requirement ONLY applies to the canonicalization algorithm -- NOT the > hashing algorithm. Indeed, a cryptographic hashing algorithm must have > exactly the opposite property: a small change in the input must produce > a LARGE (random) change in the output. > > If the proposed canonicalization algorithms already meet this > requirement, then that would be great. But I am not aware of any > testing that has been done on them with this use case in mind, to find > out whether they would meet this requirement. And I do think this > requirement is important for a general purpose canonicalization standard. > > Proposed text for use cases: > [[ > Diff/patch of RDF Datasets > For diff/patch applications that need to track changes to RDF datasets, > or keep two RDF datasets in sync by applying incremental changes, the > N-Quads Canonicalization algorithm should make best efforts to produce > results that are well suited for use with existing line-oriented diff > and patch tools. This means that, given a "small" change to an RDF > dataset -- i.e., changing only a "small" number of lines in an N-Quads > representation -- the N-Quads Canonicalization Algorithm will most > likely produce a commensurately "small" change in its canonicalized > result, for common diff/patch use cases. Common use cases include > adding/deleting a few triples, adding/deleting an RDF molecule/object > (or a concise bounded description > https://www.w3.org/Submission/2004/SUBM-CBD-20040930/ or similar), > adding/deleting a graph from an RDF dataset, adding/deleting list > elements, or adding/deleting a level of hierarchy in a tree (or tree-ish > graph) > Requirement: A Diff-Friendly N-Quads Canonicalization Algorithm > ]] > > Thanks, > David Booth > > On 4/6/21 6:20 AM, Ivan Herman wrote: > > Dear all, > > > > the W3C has started to work on a Working Group charter for Linked Data > > Signatures: > > > > https://w3c.github.io/lds-wg-charter/index.html > > <https://w3c.github.io/lds-wg-charter/index.html> > > > > The work proposed in this Working Group includes Linked Data > > Canonicalization, as well as algorithms and vocabularies for encoding > > digital proofs, such as digital signatures, and with that secure > > information expressed in serializations such as JSON-LD, TriG, and > N-Quads. > > > > The need for Linked Data canonicalization, digest, or signature has been > > known for a very long time, but it is only in recent years that research > > and development has resulted in mathematical algorithms and related > > implementations that are on the maturity level for a Web Standard. A > > separate explainer document: > > > > https://w3c.github.io/lds-wg-charter/explainer.html > > <https://w3c.github.io/lds-wg-charter/explainer.html> > > > > provides some background, as well as a small set of use cases. > > > > The W3C Credentials Community Group[1,2] has been instrumental in the > > work leading to this charter proposal, not the least due to its work on > > Verifiable Credentials and with recent applications and development on, > > e.g., vaccination passports using those technologies. > > > > It must be emphasized, however, that this work is not bound to a > > specific application area or serialization. There are numerous use cases > > in Linked Data, like the publication of biological and pharmaceutical > > data, consumption of mission critical RDF vocabularies, and others, that > > depend on the ability to verify the authenticity and integrity of the > > data being consumed. This Working Group aims at covering all those, and > > we hope to involve the Linked Data Community at large in the elaboration > > of the final charter proposal. > > > > We welcome your general expressions of interest and support. If you wish > > to make your comments public, please use GitHub issues: > > > > https://github.com/w3c/lds-wg-charter/issues > > <https://github.com/w3c/lds-wg-charter/issues> > > > > A formal W3C Advisory Committee Review for this charter is expected in > > about six weeks. > > > > [1] https://www.w3.org/community/credentials/ > > <https://www.w3.org/community/credentials/> > > [2] https://w3c-ccg.github.io/ <https://w3c-ccg.github.io/> > > > > > > ---- > > Ivan Herman, W3C > > Home: http://www.w3.org/People/Ivan/ <http://www.w3.org/People/Ivan/> > > mobile: +33 6 52 46 00 43 > > ORCID ID: https://orcid.org/0000-0003-0782-2704 > > <https://orcid.org/0000-0003-0782-2704> > > > > -- Jamie McCusker (they<she) Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute mccusj2@rpi.edu <mccusj@cs.rpi.edu> http://tw.rpi.edu
Received on Tuesday, 8 June 2021 02:04:41 UTC