Re: Thoughts on the LDS WG chartering discussion from David Booth on 2021-06-12 (semantic-web@w3.org from June 2021)

From: David Booth <david@dbooth.org>
Date: Fri, 11 Jun 2021 23:39:40 -0400
To: semantic-web@w3.org
Message-ID: <e16c8fe4-4a5e-592e-0394-f650ea524d80@dbooth.org>
Hi Aidan,

On 6/11/21 7:19 PM, Aidan Hogan wrote:
> . . . 
> 1) The hash produced is, to me, a hash of the abstract RDF dataset, 
> modulo isomorphism. If you put two isomorphic abstract RDF datasets in 
> (whatever their serialisation), you get the same hash out. The fact that 
> N-Quads might be used is an implementation detail. The hash could just 
> as well be produced over an abstract set-based representation of the 
> quads and still provide the guarantees mentioned by the explainer (which 
> is what my implementation does; it does not serialise to N-Quads and 
> then hash that string, but builds the hash in-memory over the set of 
> quads). I know the explainer specifically mentions N-Quads, and maybe 
> that's what will happen, but it is not a necessary part of the 
> implementation nor a functional requirement of the algorithm.

I can see your point, from a strictly theoretical perspective.  But from 
a practical perspective the serialization to N-Quads is *crucial*, 
because it enables a standard, off-the-shelf cryptographic hashing 
function to be used instead of a home-grown hashing function that was 
invented just for RDF.

> 2) It is true that the RDF standard states that "Blank node identifiers 
> are not part of the RDF abstract syntax, but are entirely dependent on 
> the concrete syntax or implementation." The standard also states that 
> "the set of possible blank nodes is arbitrary", and to have a set of 
> elements, we must be able to establish equality and inequality over 
> those elements; otherwise the set is not defined. So how can we deal 
> computationally with an arbitrary set of elements? We label blank nodes 
> in a one-to-one manner as a proxy and work with the labels. (We could, 
> equivalently, consider the set of labels to *be* the "arbitrary set of 
> blank nodes" mentioned by the standard.) 

That's an interesting point!  You're right.  It hadn't occurred to me 
that the labels themselves could be considered the "arbitrary" set of 
blank nodes.

But again, I see that as a purely theoretical point, because from a 
practical perspective, in order to do anything useful with a 
canonicalized RDF dataset -- such as storing it, transmitting it, 
comparing it to another canonicalized RDF dataset, or computing its hash 
-- then you either need to serialize it, or you need to create 
RDF-specific tools for those operations, which would pretty much defeat 
the purpose of doing the canonicalization in the first place.  After 
all, canonicalization is not an end goal, it is merely a means to enable 
other operations to be performed more easily or without requiring 
RDF-specific tools.

For example, to my mind a key purpose in standardizing a canonical form 
of N-Quads (aside from digital signatures) is *specifically* to enable 
*standard* diff and patch tools to work on RDF.

Thanks,
David Booth

> If the labels of two blank 
> nodes are equal, the blank nodes are equal. If the labels are not equal, 
> the blank nodes are not equal. We can use string libraries for this. 
> Almost every implementation that works with (abstract) RDF datasets does 
> this; it's not something particular to canonicalisation. So this is part 
> of an implementation of the algorithm and is completely compatible with 
> what the standard suggests of abstract RDF datasets, and those 
> implementations that work with them.†
> 
> 
> 
> 
> That said, I agree with you that the explainer could be made more 
> precise in this manner (without affecting readability) in that it does 
> refer to blank node labels as part of the dataset rather than the 
> implementation. Here are the problematic sentences I can see:
> 
> * "Two isomorphic RDF Datasets are identical except for differences in 
> how their individual blank nodes are labeled." This could be removed to 
> just leave the sentence that follows it: "In particular, R is isomorphic 
> with S if and only if it is possible to map the blank nodes of R to the 
> blank nodes of S in a one-to-one manner, generating an RDF dataset R' 
> such that R' = S." (This itself could be more formal, but I think that 
> would be out of scope for the explainer; also it is defined elsewhere.)
> 
> * "Such a canonicalization function can be implemented, in practice, as 
> a procedure that deterministically re-labels all blank nodes of an RDF 
> Dataset in a one-to-one manner" We could change "re-labels" to simply 
> "labels" to avoid the impression that the (abstract) RDF Dataset already 
> has labels.
> 
> * "without depending on the particular set of blank node labels used in 
> the input RDF Dataset" We could change to "without depending on the 
> particular set of blank node identifiers used in the serialization of 
> the input RDF Dataset"
> 
> * "It could also be referred to as a “canonical relabelling scheme”" 
> This could rather be "It could also be referred to as a “canonical 
> labeling scheme”", to avoid implying that labels already exist.
> 
> I have created a PR for this:
>      https://github.com/w3c/lds-wg-charter/pull/91
> 
> If there's some other sentence that you see as imprecise in that way, 
> let me/us know!
> 
> Best,
> Aidan
> 
> † Well, there is a perhaps one weird issue that if "the set of possible 
> blank nodes is arbitrary", then it could also be a uncountably infinite 
> set that cannot be "labelled" one-to-one by finite strings. The set of 
> blank nodes could be the set of reals, for example. So when we talk 
> about labelling blank nodes, we assume that they are countable (finite 
> or countably infinite). I am not aware that anything in the RDF standard 
> restricts the set of blank nodes to be finite, countably infinite, or 
> something else. Just to clarify that if this does turn out to be a 
> problem, it will likely be a theoretical flaw for most standards built 
> on top of RDF, and probably for the semantics of RDF graphs. I don't 
> know if this actually matters in any meaningful sense, but I think it 
> would be slightly comforting to know that somewhere, something normative 
> says or implies that the set of blank nodes is countable. :)
>
Received on Saturday, 12 June 2021 03:40:54 UTC