Re: Thoughts on the LDS WG chartering discussion from Aidan Hogan on 2021-06-11 (semantic-web@w3.org from June 2021)

From: Aidan Hogan <aidhog@gmail.com>
Date: Fri, 11 Jun 2021 19:19:03 -0400
To: semantic-web@w3.org
Message-ID: <e0e13d06-7af2-4ef3-5fc3-25a5f103bd88@gmail.com>
Hi David,

On 2021-06-11 17:14, David Booth wrote:
> On 6/11/21 9:30 AM, Eric Prud'hommeaux wrote:
>> On Fri, Jun 11, 2021 at 10:08:56AM +0100, Dan Brickley wrote:
>>> . . .
>>> Should protein databank files be RDFized before they fall in scope of 
>>> this
>>> new WGs mission?
>>> https://en.wikipedia.org/wiki/Protein_Data_Bank_(file_format) - and 
>>> if so,
>>> why?
>>
>> RDF Signatures are for signing RDF structures. Without such a
>> mechanism, you have to sign the syntax of an RDF document, which means
>> you have to keep it around, serve it preferentially whenever anyone
>> asks for a particular graph. That's a biggish ask of a quad store. It
>> would also involve inventing some protocol to say "please dig up the
>> original serialization" and probably some other convention. In the
>> end, it would be brittle and most folks would consider it a crappy hack.
> 
> I think that is excellent justification (among other good reasons) for 
> standardizing a canonical N-Quads format, and I fully support a W3C 
> effort to do that.  But I think at least one part of the confusion and 
> concern is that the proposed canonicalization is framed as an *abstract* 
> RDF Dataset canonicalization.  I think that framing is causing two 
> problems:
> 
> 1. It creates the *perception* of a greatly increased attack surface, 
> from a security standpoint, because it bundles the canonicalization 
> algorithm with the cryptographic hash generation step, and claims to 
> produce a hash of the *abstract* RDF Dataset.  In reality, it does no 
> such thing: the hash is computed on a concrete, canonicalized N-Quads 
> serialization.  But it is understandable that people would look at it 
> and worry about what new security vulnerabilities it might create, given 
> this framing.
> 
> 2. It is misleading, because *any* RDF canonicalization algorithm is 
> fundamentally about serialization -- *not* the abstract RDF Dataset. The 
> proposed algorithm is really only abstract in the sense that it can be 
> used with a family of serializations.

I'm not sure I agree on these points.

1) The hash produced is, to me, a hash of the abstract RDF dataset, 
modulo isomorphism. If you put two isomorphic abstract RDF datasets in 
(whatever their serialisation), you get the same hash out. The fact that 
N-Quads might be used is an implementation detail. The hash could just 
as well be produced over an abstract set-based representation of the 
quads and still provide the guarantees mentioned by the explainer (which 
is what my implementation does; it does not serialise to N-Quads and 
then hash that string, but builds the hash in-memory over the set of 
quads). I know the explainer specifically mentions N-Quads, and maybe 
that's what will happen, but it is not a necessary part of the 
implementation nor a functional requirement of the algorithm.

2) It is true that the RDF standard states that "Blank node identifiers 
are not part of the RDF abstract syntax, but are entirely dependent on 
the concrete syntax or implementation." The standard also states that 
"the set of possible blank nodes is arbitrary", and to have a set of 
elements, we must be able to establish equality and inequality over 
those elements; otherwise the set is not defined. So how can we deal 
computationally with an arbitrary set of elements? We label blank nodes 
in a one-to-one manner as a proxy and work with the labels. (We could, 
equivalently, consider the set of labels to *be* the "arbitrary set of 
blank nodes" mentioned by the standard.) If the labels of two blank 
nodes are equal, the blank nodes are equal. If the labels are not equal, 
the blank nodes are not equal. We can use string libraries for this. 
Almost every implementation that works with (abstract) RDF datasets does 
this; it's not something particular to canonicalisation. So this is part 
of an implementation of the algorithm and is completely compatible with 
what the standard suggests of abstract RDF datasets, and those 
implementations that work with them.†




That said, I agree with you that the explainer could be made more 
precise in this manner (without affecting readability) in that it does 
refer to blank node labels as part of the dataset rather than the 
implementation. Here are the problematic sentences I can see:

* "Two isomorphic RDF Datasets are identical except for differences in 
how their individual blank nodes are labeled." This could be removed to 
just leave the sentence that follows it: "In particular, R is isomorphic 
with S if and only if it is possible to map the blank nodes of R to the 
blank nodes of S in a one-to-one manner, generating an RDF dataset R' 
such that R' = S." (This itself could be more formal, but I think that 
would be out of scope for the explainer; also it is defined elsewhere.)

* "Such a canonicalization function can be implemented, in practice, as 
a procedure that deterministically re-labels all blank nodes of an RDF 
Dataset in a one-to-one manner" We could change "re-labels" to simply 
"labels" to avoid the impression that the (abstract) RDF Dataset already 
has labels.

* "without depending on the particular set of blank node labels used in 
the input RDF Dataset" We could change to "without depending on the 
particular set of blank node identifiers used in the serialization of 
the input RDF Dataset"

* "It could also be referred to as a “canonical relabelling scheme”" 
This could rather be "It could also be referred to as a “canonical 
labeling scheme”", to avoid implying that labels already exist.

I have created a PR for this:
 https://github.com/w3c/lds-wg-charter/pull/91

If there's some other sentence that you see as imprecise in that way, 
let me/us know!

Best,
Aidan

† Well, there is a perhaps one weird issue that if "the set of possible 
blank nodes is arbitrary", then it could also be a uncountably infinite 
set that cannot be "labelled" one-to-one by finite strings. The set of 
blank nodes could be the set of reals, for example. So when we talk 
about labelling blank nodes, we assume that they are countable (finite 
or countably infinite). I am not aware that anything in the RDF standard 
restricts the set of blank nodes to be finite, countably infinite, or 
something else. Just to clarify that if this does turn out to be a 
problem, it will likely be a theoretical flaw for most standards built 
on top of RDF, and probably for the semantics of RDF graphs. I don't 
know if this actually matters in any meaningful sense, but I think it 
would be slightly comforting to know that somewhere, something 
normative says or implies that the set of blank nodes is countable. :)
Received on Friday, 11 June 2021 23:20:05 UTC