Re: Thoughts on the LDS WG chartering discussion from Eric Prud'hommeaux on 2021-06-12 (semantic-web@w3.org from June 2021)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Sat, 12 Jun 2021 17:34:54 +0200
To: David Booth <david@dbooth.org>
Cc: semantic-web@w3.org
Message-ID: <20210612153454.GD8464@w3.org>
On Fri, Jun 11, 2021 at 11:39:40PM -0400, David Booth wrote:
> Hi Aidan,
> 
> On 6/11/21 7:19 PM, Aidan Hogan wrote:
> > . . . 1) The hash produced is, to me, a hash of the abstract RDF
> > dataset, modulo isomorphism. If you put two isomorphic abstract RDF
> > datasets in (whatever their serialisation), you get the same hash out.
> > The fact that N-Quads might be used is an implementation detail. The
> > hash could just as well be produced over an abstract set-based
> > representation of the quads and still provide the guarantees mentioned
> > by the explainer (which is what my implementation does; it does not
> > serialise to N-Quads and then hash that string, but builds the hash
> > in-memory over the set of quads). I know the explainer specifically
> > mentions N-Quads, and maybe that's what will happen, but it is not a
> > necessary part of the implementation nor a functional requirement of the
> > algorithm.
> 
> I can see your point, from a strictly theoretical perspective.  But from a
> practical perspective the serialization to N-Quads is *crucial*, because it
> enables a standard, off-the-shelf cryptographic hashing function to be used
> instead of a home-grown hashing function that was invented just for RDF.

One could implement URDNA without ever serializing N-Quads with a
hashCode() function on Quad, BNode, IRI and Literal:

[[
Hash Quad.urdna (h: Hasher) {
  return h.add(
    g.urdna(h), h(" "),
    s.urdna(h), h(" "),
    p.urdna(h), h(" "),
    o.urdna(h), " .\n"
  );
}

Hash IRI.urdna (h: Hasher) {
  return h.add('<', value, '>');
}

Hash BNode.urdna (h: Hasher) {
  static int ordinal = 1;
  return h.add("_:c14n_" + ordinal++);
}
]]

Literal elided, and you'd want conditional code around Quad.g, but you
get the idea. Note that this approach means you never need to
duplicate your graph in memory as you do when constructing a
potentially giant N-Quads string.

From a geek perspective, this abstract hashing function looks a little
arbitrary 'cause you could get just as strong a hash if you eliminated
all of the string constants (" ", "<", ">", ...). But keeping them is
cheap and allows you to serialize the result in a widely-supported RDF
syntax. RDF C14N, XML C14N and JWS's C14N all canonicalize to
serializable data, which makese it easier to implement and makes
user's lives way easier 'cause they can printf the input to the
hashing function.

I believe that describing hashing as a function over N-Quads suggests
that folks need to serialize as N-Quads to use it. I think the wording
describing it as function over an (abstract) RDF Graph (Dataset,
whatever) to say exactly what I as an RDF user want to know.


> > On 2021-06-11 17:14, David Booth wrote
> > > 2. It is misleading, because *any* RDF canonicalization algorithm is 
> > > fundamentally about serialization -- *not* the abstract RDF Dataset. The 
> > > proposed algorithm is really only abstract in the sense that it can be 
> > > used with a family of serializations.

I think the above shows that it's not fundamentally about a
serialization, more "compatible with" serialization. It doesn't
undermine your diff/patch utility below, but it does convey that it's
a function over a Dataset, not a string of unicode characters.


> > 2) It is true that the RDF standard states that "Blank node identifiers
> > are not part of the RDF abstract syntax, but are entirely dependent on
> > the concrete syntax or implementation." The standard also states that
> > "the set of possible blank nodes is arbitrary", and to have a set of
> > elements, we must be able to establish equality and inequality over
> > those elements; otherwise the set is not defined. So how can we deal
> > computationally with an arbitrary set of elements? We label blank nodes
> > in a one-to-one manner as a proxy and work with the labels. (We could,
> > equivalently, consider the set of labels to *be* the "arbitrary set of
> > blank nodes" mentioned by the standard.)
> 
> That's an interesting point!  You're right.  It hadn't occurred to me that
> the labels themselves could be considered the "arbitrary" set of blank
> nodes.
> 
> But again, I see that as a purely theoretical point, because from a
> practical perspective, in order to do anything useful with a canonicalized
> RDF dataset -- such as storing it, transmitting it, comparing it to another
> canonicalized RDF dataset, or computing its hash -- then you either need to
> serialize it, or you need to create RDF-specific tools for those operations,
> which would pretty much defeat the purpose of doing the canonicalization in
> the first place.  After all, canonicalization is not an end goal, it is
> merely a means to enable other operations to be performed more easily or
> without requiring RDF-specific tools.
> 
> For example, to my mind a key purpose in standardizing a canonical form of
> N-Quads (aside from digital signatures) is *specifically* to enable
> *standard* diff and patch tools to work on RDF.
> 
> Thanks,
> David Booth
> 
> > If the labels of two blank nodes are equal, the blank nodes are equal.
> > If the labels are not equal, the blank nodes are not equal. We can use
> > string libraries for this. Almost every implementation that works with
> > (abstract) RDF datasets does this; it's not something particular to
> > canonicalisation. So this is part of an implementation of the algorithm
> > and is completely compatible with what the standard suggests of abstract
> > RDF datasets, and those implementations that work with them.†
> > 
> > 
> > 
> > 
> > That said, I agree with you that the explainer could be made more
> > precise in this manner (without affecting readability) in that it does
> > refer to blank node labels as part of the dataset rather than the
> > implementation. Here are the problematic sentences I can see:
> > 
> > * "Two isomorphic RDF Datasets are identical except for differences in
> > how their individual blank nodes are labeled." This could be removed to
> > just leave the sentence that follows it: "In particular, R is isomorphic
> > with S if and only if it is possible to map the blank nodes of R to the
> > blank nodes of S in a one-to-one manner, generating an RDF dataset R'
> > such that R' = S." (This itself could be more formal, but I think that
> > would be out of scope for the explainer; also it is defined elsewhere.)
> > 
> > * "Such a canonicalization function can be implemented, in practice, as
> > a procedure that deterministically re-labels all blank nodes of an RDF
> > Dataset in a one-to-one manner" We could change "re-labels" to simply
> > "labels" to avoid the impression that the (abstract) RDF Dataset already
> > has labels.
> > 
> > * "without depending on the particular set of blank node labels used in
> > the input RDF Dataset" We could change to "without depending on the
> > particular set of blank node identifiers used in the serialization of
> > the input RDF Dataset"
> > 
> > * "It could also be referred to as a “canonical relabelling scheme”"
> > This could rather be "It could also be referred to as a “canonical
> > labeling scheme”", to avoid implying that labels already exist.
> > 
> > I have created a PR for this:
> >      https://github.com/w3c/lds-wg-charter/pull/91
> > 
> > If there's some other sentence that you see as imprecise in that way,
> > let me/us know!
> > 
> > Best,
> > Aidan
> > 
> > † Well, there is a perhaps one weird issue that if "the set of possible
> > blank nodes is arbitrary", then it could also be a uncountably infinite
> > set that cannot be "labelled" one-to-one by finite strings. The set of
> > blank nodes could be the set of reals, for example. So when we talk
> > about labelling blank nodes, we assume that they are countable (finite
> > or countably infinite). I am not aware that anything in the RDF standard
> > restricts the set of blank nodes to be finite, countably infinite, or
> > something else. Just to clarify that if this does turn out to be a
> > problem, it will likely be a theoretical flaw for most standards built
> > on top of RDF, and probably for the semantics of RDF graphs. I don't
> > know if this actually matters in any meaningful sense, but I think it
> > would be slightly comforting to know that somewhere, something normative
> > says or implies that the set of blank nodes is countable. :)
> > 
>
Received on Saturday, 12 June 2021 15:35:14 UTC