Re: Chartering work has started for a Linked Data Signature Working Group @W3C from Ivan Herman on 2021-05-21 (semantic-web@w3.org from May 2021)

From: Ivan Herman <ivan@w3.org>
Date: Fri, 21 May 2021 12:07:42 +0200
To: Aidan Hogan <aidhog@gmail.com>
Cc: semantic-web <semantic-web@w3.org>, Manu Sporny <msporny@digitalbazaar.com>, Dave Longley <dlongley@digitalbazaar.com>, Markus Sabadello <markus@danubetech.com>, Phil Archer <phil.archer@gs1.org>
Message-Id: <E0EE234F-815F-4B53-80DD-DFDBC10D74B1@w3.org>
This is just an 'admin' reply, to make sure that the prospective co-chairs of the proposed WG, as well as some other interested persons receive this mail though they are not necessarily on the semantic web mailing list.

Ivan

> On 21 May 2021, at 00:58, Aidan Hogan <aidhog@gmail.com> wrote:
> 
> Jumping into the conversation a little late ... (apologies) ... but wanted to reply to some of the issues raised.
> 
> 
> Given that various documents and topics have been referenced and discussed, it might be useful to recap what I understand to be the *core* of the proposed work so we're on the same page (this is just my personal understanding and might not be correct):
> 
> (1) RDF Canonicalisation
> Input: An RDF dataset
> Output: A canonical RDF dataset modulo isomorphism
> 
> (2.1) Digital Signature: Signing
> Input: A set of bytes, a private key
> Output: A signature for the bytes
> 
> (2.2) Digital Signature: Verification
> Input: A set of bytes, a signature, a (paired) public key
> Output: true if signature corresponds to bytes and key pair
> 	(false otherwise)
> 
> (3) Additional elements involving embedded signature and verification metadata, proof purpose, domains, etc.
> 
> 
> Regarding (1), this is certainly doable with two caveats:
> C1: known algorithms can take exponential time for certain graphs; though those cases are "exotic" and unlikely to occur in practice, small graphs that cannot be canonicalised in reasonable time could be constructed easily enough (as part of an attack? I'm not sure);
> C2: care needs to be taken for hash collisions, though this can be managed with workarounds.
> 
> A more operational challenge here is that we would likely have to standardise an *algorithm*.
> 
> 
> Regarding (2), I believe this to be pretty vanilla cryptography, e.g.:
> 
> 	https://en.wikipedia.org/wiki/Digital_signature#Definition
> 
> The result of (1) can be trivially serialised as a canonical set of bytes such that if two RDF datasets are isomorphic, they produce the same bytes; and if the bytes are the same, the input RDF datasets were isomorphic. Plug (1) into (2) and you have a digital signature framework for RDF datasets (modulo isomorphism).
> 
> 
> Regarding (3), I don't really have a good handle on the details.
> 
> 
> The new ingredient here is (1).
> 
> I proposed an algorithm for (1) [see A] (defined for RDF graphs, but it can be trivially extended for RDF datasets), which I am pretty confident about (modulo C1 and C2). It has been peer-reviewed twice (conference and journal paper, but possibly with overlapping reviewers) and tested on lots of examples. The algorithm is based heavily on an algorithm for graph isomorphism that's been around for decades. The general proof argument is relatively straightforward.
> 
> Dave, Rachel and Manu also have an algorithm for (1) [see B]. The initial algorithm was proposed before mine. Dave and Manu initially developed it, Rachel and Dave later formalised it and provided a proof, I provided some input on the formalisation and the proof, and a more detailed proof was more recently provided by Mirabolic Consulting here:
> 
> 	https://lists.w3.org/Archives/Public/public-credentials/2021Apr/att-0032/Mirabolic_Graph_Iso_Report_2020_10_19.pdf
> 
> I have looked into the report, and as someone who has worked on the topic, I find it to be very thorough and competent. I think there is a small caveat, however: the algorithm is complex, and so the proof is more complex, and so verifying the proof is even more complex. This is not a question of expertise or understanding: it is not typical to have to prove such a statement. The part that personally gives me slight pause is "that the order in which such nodes are given canonical identifiers does not matter". This becomes a very complex question in the context of the algorithm, and in turn the proof. Ideally I think it would be good to get some more eyeballs on that part. (In case there is an issue, it would only affect exceedingly exotic cases, and I think that finding a solution would not be difficult. The fact that it works for known counter-examples is promising, for sure.)
> 
> On a side note, I think that the "Correctness by Design" section of the Mirabolic Consulting review linked above is very astute. It is easier to start with the general conditions of correctness and design the algorithm around them, rather than to propose an (in this case, complex) algorithm and later try to prove correctness for it.
> 
> 
> Regarding (3), I am not a security expert, so I'm not in a position to say much. @Manu, you mentioned that:
> 
> """
> That said, the most important aspects of these documents have undergone
> multiple reviews over the past several years by trained mathematicians
> analysing the formal proofs as well as security and cryptography engineers, as stated in the explainer:
> """
> 
> RDF Dataset canonicalisation has indeed undergone review by trained mathematicians as mentioned before, but to the best of my knowledge, the people involved (those findable from the explainer) are not security or cryptography experts. Which security and cryptography engineers have reviewed which parts? It would be good to see input from such experts regarding (2) and particularly (3).
> 
> 
> Regarding RDF vs. Linked Data ... I tend to agree with Dan that the standards should talk about RDF when they are talking about RDF and not conflate RDF with Linked Data. Linked Data is well-defined, and it's not the same as RDF. To me, it would be (somewhat) akin to saying that HTML is the same as the Web, or that the Web is not well-defined and so we are okay with using the terms HTML and Web interchangeably. This conflation is best avoided, I think. Following that analogy, what we have is akin to canonicalising/signing a HTML document (an RDF dataset), which is not the same as canonicalising/signing the Web (Linked Data). (And yep, I know this analogy is a bit problematic, but I think it is useful as an informal way to illustrate my thoughts on the issue, rather than as some specific technical claim.)
> 
> There are pragmatic aspects that I don't really understand in terms of how RDF is viewed vs. Linked Data that may imply it is better overall to favour the latter term even though the former is more precise. Perhaps a useful semantic distinction is the following: we are not proposing canonicalisation/signing *of* Linked Data, but rather we are proposing canonicalisation/signing *for* Linked Data (*of* RDF datasets). This way we can perhaps still speak of "Linked Data Signatures" (more accurately "Signatures *for* Linked Data") but without having to conflate RDF and Linked Data (i.e., interpreting it as "Signatures *of* RDF *for* Linked Data" ... we are signing RDF for Linked Data settings/applications).
> 
> 
> Regarding Dan's concerns about the meaning of an RDF dataset changing based on changes in remote documents on the Web (e.g., the vocabularies used), there are two related issues lurking under that overcoat:
> 
> 1. The first issue relates to the context of the RDF dataset in a Linked Data setting, and more specifically, resolving its "dependencies", such as the vocabularies it uses, etc. There are ways to address this, but I feel that it's out of scope (which we should make clear), and perhaps relates to the issue of clarifying whether or not we are signing RDF datasets, or signing Linked Data (as discussed previously).
> 
> 2. The second issue relates to the semantics of RDF datasets (e.g., under some entailment regime). This is a much more difficult issue. It's not completely hopeless, and there are related works on this (e.g., I proposed an algorithm for canonicalising RDF graphs modulo simple semantics [A], i.e., lean/non-lean graphs, and there are techniques that one could consider for RDFS and some profiles of OWL perhaps), but I think the consensus is (rightly) that this is out-of-scope. What we are discussing here is canonicalising and signing an RDF dataset, not what the dataset means or what it says. There is no *interpretation* at play here, like how SPARQL returns IRIs and literals rather than people and dates. Anything that hints at canonicalising or signing what the RDF dataset *says* or *claims* or *means* within this charter would not be appropriate, I think. Better to talk about canonicalising/signing RDF datasets themselves (or maybe what they *contain*).
> 
> 
> Best,
> Aidan
> 
> [A] http://aidanhogan.com/docs/rdf-canonicalisation.pdf
> [B] https://lists.w3.org/Archives/Public/public-credentials/2021Mar/att-0220/RDFDatasetCanonicalization-2020-10-09.pdf 
> 
> On 2021-05-19 11:10, Manu Sporny wrote:
>> On 5/13/21 10:15 AM, Peter F. Patel-Schneider wrote:
>>> My understanding is that each and every aspect of a proposal involving
>>> security is important, down to the smallest details.
>> Yes, correct.
>>> So it isn't just that parts of the methods described in
>>> https://w3c-ccg.github.io/ld-proofs/ are considered to be secure, each and
>>> every part of the methods have to have been shown to be secure.
>> Yes, correct.
>>> It is the case that the entirely of the methods in
>>> https://w3c-ccg.github.io/ld-proofs/ have been shown to be secure?
>> Keep in mind that everyone will have a different definition of "shown to be
>> secure" -- mathematicians and cryptographers require a set of assumptions to
>> provide any sort of answer and the answer is usually in the form of "given
>> assumptions A-G, with parameters Q-Z applied, the solution is expected to
>> require H compute-years on J hardware to have a K probability of a
>> collision/solution being successfully found." ... and you'll note that the
>> statement doesn't say anything about "shown to be secure".
>> All that to say, people that don't spend years in cryptography and mathematics
>> think that there are clear answers to questions around security when there are
>> typically an array of assumptions and parameters that go along with questions
>> and answers related to the notion of "secure".
>> All that to say -- a growing number of people that have been working on this
>> stuff for many years now thinks it's "secure", we're using "secure"
>> cryptographic primitives and parameters, we're bundling the messages in ways
>> that we believe are "secure" and doesn't reduce the levels of security
>> provided by the cryptographic primitives, and we have independent mathematical
>> proofs on the generalized solveability of the RDF Dataset Canonicalization
>> problem.
>> Has it been shown to be secure? The proposers of the LDS WG Charter think so,
>> and there are papers written on the topic, otherwise we wouldn't be proposing
>> a LDS WG Charter. While that's true, we also want the sort of broad review
>> that a W3C WG can provide... we've done as much as we can outside a WG, we now
>> want more eyes on it to increase the probability of finding issues if there
>> are any.
>>> Where are the implementations of the methods in
>>> https://w3c-ccg.github.io/ld-proofs/?  I'm willing to test them out.
>> Spec is here:
>> https://json-ld.github.io/rdf-dataset-canonicalization/
>> Test suite is here:
>> https://json-ld.github.io/rdf-dataset-canonicalization/tests/index.html
>> There are several independent implementations that run against the test suite
>> here:
>> JS - https://github.com/digitalbazaar/rdf-canonize
>> C++ - https://github.com/digitalbazaar/rdf-canonize-native
>> Java - https://github.com/setl/rdf-urdna
>> Go - https://github.com/piprate/json-gold
>> Python - https://github.com/digitalbazaar/pyld
>> Rust - https://github.com/digitalbazaar/rdf-canonize-rs
>> Ruby - https://github.com/ruby-rdf/rdf-normalize
>>> In particular I'm looking for implementations that take a document encoding
>>> an RDF dataset and produce a document signing the dataset and
>>> implementations that take a document encoding a signed RDF dataset and
>>> verify the signing.
>> There are libraries that do this at a higher layer (for Verifiable
>> Credentials) here:
>> https://github.com/digitalbazaar/vc-js
>> https://github.com/spruceid/didkit
>> https://github.com/danubetech/verifiable-credentials-java
>> ... and so on.
>> Hopefully that's enough to get you started.
>> -- manu
> 


----
Ivan Herman, W3C 
Home: http://www.w3.org/People/Ivan/
mobile: +33 6 52 46 00 43
ORCID ID: https://orcid.org/0000-0003-0782-2704
Received on Friday, 21 May 2021 10:07:53 UTC