- From: Aidan Hogan <aidhog@gmail.com>
- Date: Fri, 11 Jun 2021 19:19:03 -0400
- To: semantic-web@w3.org
Hi David, On 2021-06-11 17:14, David Booth wrote: > On 6/11/21 9:30 AM, Eric Prud'hommeaux wrote: >> On Fri, Jun 11, 2021 at 10:08:56AM +0100, Dan Brickley wrote: >>> . . . >>> Should protein databank files be RDFized before they fall in scope of >>> this >>> new WGs mission? >>> https://en.wikipedia.org/wiki/Protein_Data_Bank_(file_format) - and >>> if so, >>> why? >> >> RDF Signatures are for signing RDF structures. Without such a >> mechanism, you have to sign the syntax of an RDF document, which means >> you have to keep it around, serve it preferentially whenever anyone >> asks for a particular graph. That's a biggish ask of a quad store. It >> would also involve inventing some protocol to say "please dig up the >> original serialization" and probably some other convention. In the >> end, it would be brittle and most folks would consider it a crappy hack. > > I think that is excellent justification (among other good reasons) for > standardizing a canonical N-Quads format, and I fully support a W3C > effort to do that. But I think at least one part of the confusion and > concern is that the proposed canonicalization is framed as an *abstract* > RDF Dataset canonicalization. I think that framing is causing two > problems: > > 1. It creates the *perception* of a greatly increased attack surface, > from a security standpoint, because it bundles the canonicalization > algorithm with the cryptographic hash generation step, and claims to > produce a hash of the *abstract* RDF Dataset. In reality, it does no > such thing: the hash is computed on a concrete, canonicalized N-Quads > serialization. But it is understandable that people would look at it > and worry about what new security vulnerabilities it might create, given > this framing. > > 2. It is misleading, because *any* RDF canonicalization algorithm is > fundamentally about serialization -- *not* the abstract RDF Dataset. The > proposed algorithm is really only abstract in the sense that it can be > used with a family of serializations. I'm not sure I agree on these points. 1) The hash produced is, to me, a hash of the abstract RDF dataset, modulo isomorphism. If you put two isomorphic abstract RDF datasets in (whatever their serialisation), you get the same hash out. The fact that N-Quads might be used is an implementation detail. The hash could just as well be produced over an abstract set-based representation of the quads and still provide the guarantees mentioned by the explainer (which is what my implementation does; it does not serialise to N-Quads and then hash that string, but builds the hash in-memory over the set of quads). I know the explainer specifically mentions N-Quads, and maybe that's what will happen, but it is not a necessary part of the implementation nor a functional requirement of the algorithm. 2) It is true that the RDF standard states that "Blank node identifiers are not part of the RDF abstract syntax, but are entirely dependent on the concrete syntax or implementation." The standard also states that "the set of possible blank nodes is arbitrary", and to have a set of elements, we must be able to establish equality and inequality over those elements; otherwise the set is not defined. So how can we deal computationally with an arbitrary set of elements? We label blank nodes in a one-to-one manner as a proxy and work with the labels. (We could, equivalently, consider the set of labels to *be* the "arbitrary set of blank nodes" mentioned by the standard.) If the labels of two blank nodes are equal, the blank nodes are equal. If the labels are not equal, the blank nodes are not equal. We can use string libraries for this. Almost every implementation that works with (abstract) RDF datasets does this; it's not something particular to canonicalisation. So this is part of an implementation of the algorithm and is completely compatible with what the standard suggests of abstract RDF datasets, and those implementations that work with them.† That said, I agree with you that the explainer could be made more precise in this manner (without affecting readability) in that it does refer to blank node labels as part of the dataset rather than the implementation. Here are the problematic sentences I can see: * "Two isomorphic RDF Datasets are identical except for differences in how their individual blank nodes are labeled." This could be removed to just leave the sentence that follows it: "In particular, R is isomorphic with S if and only if it is possible to map the blank nodes of R to the blank nodes of S in a one-to-one manner, generating an RDF dataset R' such that R' = S." (This itself could be more formal, but I think that would be out of scope for the explainer; also it is defined elsewhere.) * "Such a canonicalization function can be implemented, in practice, as a procedure that deterministically re-labels all blank nodes of an RDF Dataset in a one-to-one manner" We could change "re-labels" to simply "labels" to avoid the impression that the (abstract) RDF Dataset already has labels. * "without depending on the particular set of blank node labels used in the input RDF Dataset" We could change to "without depending on the particular set of blank node identifiers used in the serialization of the input RDF Dataset" * "It could also be referred to as a “canonical relabelling scheme”" This could rather be "It could also be referred to as a “canonical labeling scheme”", to avoid implying that labels already exist. I have created a PR for this: https://github.com/w3c/lds-wg-charter/pull/91 If there's some other sentence that you see as imprecise in that way, let me/us know! Best, Aidan † Well, there is a perhaps one weird issue that if "the set of possible blank nodes is arbitrary", then it could also be a uncountably infinite set that cannot be "labelled" one-to-one by finite strings. The set of blank nodes could be the set of reals, for example. So when we talk about labelling blank nodes, we assume that they are countable (finite or countably infinite). I am not aware that anything in the RDF standard restricts the set of blank nodes to be finite, countably infinite, or something else. Just to clarify that if this does turn out to be a problem, it will likely be a theoretical flaw for most standards built on top of RDF, and probably for the semantics of RDF graphs. I don't know if this actually matters in any meaningful sense, but I think it would be slightly comforting to know that somewhere, something normative says or implies that the set of blank nodes is countable. :)
Received on Friday, 11 June 2021 23:20:05 UTC