Re: Chartering work has started for a Linked Data Signature Working Group @W3C from Aidan Hogan on 2021-05-20 (semantic-web@w3.org from May 2021)

From: Aidan Hogan <aidhog@gmail.com>
Date: Thu, 20 May 2021 18:58:56 -0400
To: semantic-web@w3.org
Message-ID: <8d4b9469-a10d-312c-fb14-59a2d8515e76@gmail.com>
Jumping into the conversation a little late ... (apologies) ... but 
wanted to reply to some of the issues raised.


Given that various documents and topics have been referenced and 
discussed, it might be useful to recap what I understand to be the 
*core* of the proposed work so we're on the same page (this is just my 
personal understanding and might not be correct):

(1) RDF Canonicalisation
Input: An RDF dataset
Output: A canonical RDF dataset modulo isomorphism

(2.1) Digital Signature: Signing
Input: A set of bytes, a private key
Output: A signature for the bytes

(2.2) Digital Signature: Verification
Input: A set of bytes, a signature, a (paired) public key
Output: true if signature corresponds to bytes and key pair
 (false otherwise)

(3) Additional elements involving embedded signature and verification 
metadata, proof purpose, domains, etc.


Regarding (1), this is certainly doable with two caveats:
C1: known algorithms can take exponential time for certain graphs; 
though those cases are "exotic" and unlikely to occur in practice, small 
graphs that cannot be canonicalised in reasonable time could be 
constructed easily enough (as part of an attack? I'm not sure);
C2: care needs to be taken for hash collisions, though this can be 
managed with workarounds.

A more operational challenge here is that we would likely have to 
standardise an *algorithm*.


Regarding (2), I believe this to be pretty vanilla cryptography, e.g.:

 https://en.wikipedia.org/wiki/Digital_signature#Definition

The result of (1) can be trivially serialised as a canonical set of 
bytes such that if two RDF datasets are isomorphic, they produce the 
same bytes; and if the bytes are the same, the input RDF datasets were 
isomorphic. Plug (1) into (2) and you have a digital signature framework 
for RDF datasets (modulo isomorphism).


Regarding (3), I don't really have a good handle on the details.


The new ingredient here is (1).

I proposed an algorithm for (1) [see A] (defined for RDF graphs, but it 
can be trivially extended for RDF datasets), which I am pretty confident 
about (modulo C1 and C2). It has been peer-reviewed twice (conference 
and journal paper, but possibly with overlapping reviewers) and tested 
on lots of examples. The algorithm is based heavily on an algorithm for 
graph isomorphism that's been around for decades. The general proof 
argument is relatively straightforward.

Dave, Rachel and Manu also have an algorithm for (1) [see B]. The 
initial algorithm was proposed before mine. Dave and Manu initially 
developed it, Rachel and Dave later formalised it and provided a proof, 
I provided some input on the formalisation and the proof, and a more 
detailed proof was more recently provided by Mirabolic Consulting here:

 https://lists.w3.org/Archives/Public/public-credentials/2021Apr/att-0032/Mirabolic_Graph_Iso_Report_2020_10_19.pdf

I have looked into the report, and as someone who has worked on the 
topic, I find it to be very thorough and competent. I think there is a 
small caveat, however: the algorithm is complex, and so the proof is 
more complex, and so verifying the proof is even more complex. This is 
not a question of expertise or understanding: it is not typical to have 
to prove such a statement. The part that personally gives me slight 
pause is "that the order in which such nodes are given canonical 
identifiers does not matter". This becomes a very complex question in 
the context of the algorithm, and in turn the proof. Ideally I think it 
would be good to get some more eyeballs on that part. (In case there is 
an issue, it would only affect exceedingly exotic cases, and I think 
that finding a solution would not be difficult. The fact that it works 
for known counter-examples is promising, for sure.)

On a side note, I think that the "Correctness by Design" section of the 
Mirabolic Consulting review linked above is very astute. It is easier to 
start with the general conditions of correctness and design the 
algorithm around them, rather than to propose an (in this case, complex) 
algorithm and later try to prove correctness for it.


Regarding (3), I am not a security expert, so I'm not in a position to 
say much. @Manu, you mentioned that:

"""
That said, the most important aspects of these documents have undergone
multiple reviews over the past several years by trained mathematicians
analysing the formal proofs as well as security and cryptography 
engineers, as stated in the explainer:
"""

RDF Dataset canonicalisation has indeed undergone review by trained 
mathematicians as mentioned before, but to the best of my knowledge, the 
people involved (those findable from the explainer) are not security or 
cryptography experts. Which security and cryptography engineers have 
reviewed which parts? It would be good to see input from such experts 
regarding (2) and particularly (3).


Regarding RDF vs. Linked Data ... I tend to agree with Dan that the 
standards should talk about RDF when they are talking about RDF and not 
conflate RDF with Linked Data. Linked Data is well-defined, and it's not 
the same as RDF. To me, it would be (somewhat) akin to saying that HTML 
is the same as the Web, or that the Web is not well-defined and so we 
are okay with using the terms HTML and Web interchangeably. This 
conflation is best avoided, I think. Following that analogy, what we 
have is akin to canonicalising/signing a HTML document (an RDF dataset), 
which is not the same as canonicalising/signing the Web (Linked Data). 
(And yep, I know this analogy is a bit problematic, but I think it is 
useful as an informal way to illustrate my thoughts on the issue, rather 
than as some specific technical claim.)

There are pragmatic aspects that I don't really understand in terms of 
how RDF is viewed vs. Linked Data that may imply it is better overall to 
favour the latter term even though the former is more precise. Perhaps a 
useful semantic distinction is the following: we are not proposing 
canonicalisation/signing *of* Linked Data, but rather we are proposing 
canonicalisation/signing *for* Linked Data (*of* RDF datasets). This way 
we can perhaps still speak of "Linked Data Signatures" (more accurately 
"Signatures *for* Linked Data") but without having to conflate RDF and 
Linked Data (i.e., interpreting it as "Signatures *of* RDF *for* Linked 
Data" ... we are signing RDF for Linked Data settings/applications).


Regarding Dan's concerns about the meaning of an RDF dataset changing 
based on changes in remote documents on the Web (e.g., the vocabularies 
used), there are two related issues lurking under that overcoat:

1. The first issue relates to the context of the RDF dataset in a Linked 
Data setting, and more specifically, resolving its "dependencies", such 
as the vocabularies it uses, etc. There are ways to address this, but I 
feel that it's out of scope (which we should make clear), and perhaps 
relates to the issue of clarifying whether or not we are signing RDF 
datasets, or signing Linked Data (as discussed previously).

2. The second issue relates to the semantics of RDF datasets (e.g., 
under some entailment regime). This is a much more difficult issue. It's 
not completely hopeless, and there are related works on this (e.g., I 
proposed an algorithm for canonicalising RDF graphs modulo simple 
semantics [A], i.e., lean/non-lean graphs, and there are techniques that 
one could consider for RDFS and some profiles of OWL perhaps), but I 
think the consensus is (rightly) that this is out-of-scope. What we are 
discussing here is canonicalising and signing an RDF dataset, not what 
the dataset means or what it says. There is no *interpretation* at play 
here, like how SPARQL returns IRIs and literals rather than people and 
dates. Anything that hints at canonicalising or signing what the RDF 
dataset *says* or *claims* or *means* within this charter would not be 
appropriate, I think. Better to talk about canonicalising/signing RDF 
datasets themselves (or maybe what they *contain*).


Best,
Aidan

[A] http://aidanhogan.com/docs/rdf-canonicalisation.pdf
[B] 
https://lists.w3.org/Archives/Public/public-credentials/2021Mar/att-0220/RDFDatasetCanonicalization-2020-10-09.pdf 


On 2021-05-19 11:10, Manu Sporny wrote:
> On 5/13/21 10:15 AM, Peter F. Patel-Schneider wrote:
>> My understanding is that each and every aspect of a proposal involving
>> security is important, down to the smallest details.
> 
> Yes, correct.
> 
>> So it isn't just that parts of the methods described in
>> https://w3c-ccg.github.io/ld-proofs/ are considered to be secure, each and
>> every part of the methods have to have been shown to be secure.
> 
> Yes, correct.
> 
>> It is the case that the entirely of the methods in
>> https://w3c-ccg.github.io/ld-proofs/ have been shown to be secure?
> 
> Keep in mind that everyone will have a different definition of "shown to be
> secure" -- mathematicians and cryptographers require a set of assumptions to
> provide any sort of answer and the answer is usually in the form of "given
> assumptions A-G, with parameters Q-Z applied, the solution is expected to
> require H compute-years on J hardware to have a K probability of a
> collision/solution being successfully found." ... and you'll note that the
> statement doesn't say anything about "shown to be secure".
> 
> All that to say, people that don't spend years in cryptography and mathematics
> think that there are clear answers to questions around security when there are
> typically an array of assumptions and parameters that go along with questions
> and answers related to the notion of "secure".
> 
> All that to say -- a growing number of people that have been working on this
> stuff for many years now thinks it's "secure", we're using "secure"
> cryptographic primitives and parameters, we're bundling the messages in ways
> that we believe are "secure" and doesn't reduce the levels of security
> provided by the cryptographic primitives, and we have independent mathematical
> proofs on the generalized solveability of the RDF Dataset Canonicalization
> problem.
> 
> Has it been shown to be secure? The proposers of the LDS WG Charter think so,
> and there are papers written on the topic, otherwise we wouldn't be proposing
> a LDS WG Charter. While that's true, we also want the sort of broad review
> that a W3C WG can provide... we've done as much as we can outside a WG, we now
> want more eyes on it to increase the probability of finding issues if there
> are any.
> 
>> Where are the implementations of the methods in
>> https://w3c-ccg.github.io/ld-proofs/?  I'm willing to test them out.
> 
> Spec is here:
> 
> https://json-ld.github.io/rdf-dataset-canonicalization/
> 
> Test suite is here:
> 
> https://json-ld.github.io/rdf-dataset-canonicalization/tests/index.html
> 
> There are several independent implementations that run against the test suite
> here:
> 
> JS - https://github.com/digitalbazaar/rdf-canonize
> C++ - https://github.com/digitalbazaar/rdf-canonize-native
> Java - https://github.com/setl/rdf-urdna
> Go - https://github.com/piprate/json-gold
> Python - https://github.com/digitalbazaar/pyld
> Rust - https://github.com/digitalbazaar/rdf-canonize-rs
> Ruby - https://github.com/ruby-rdf/rdf-normalize
> 
>> In particular I'm looking for implementations that take a document encoding
>> an RDF dataset and produce a document signing the dataset and
>> implementations that take a document encoding a signed RDF dataset and
>> verify the signing.
> 
> There are libraries that do this at a higher layer (for Verifiable
> Credentials) here:
> 
> https://github.com/digitalbazaar/vc-js
> https://github.com/spruceid/didkit
> https://github.com/danubetech/verifiable-credentials-java
> 
> ... and so on.
> 
> Hopefully that's enough to get you started.
> 
> -- manu
>
Received on Thursday, 20 May 2021 22:59:12 UTC