Re: Thoughts on the LDS WG chartering discussion from Eric Prud'hommeaux on 2021-06-13 (semantic-web@w3.org from June 2021)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Sun, 13 Jun 2021 10:59:48 +0200
To: David Booth <david@dbooth.org>
Cc: semantic-web@w3.org
Message-ID: <20210613085948.GH8464@w3.org>
On Sat, Jun 12, 2021 at 10:10:35PM -0400, David Booth wrote:
> On 6/12/21 11:34 AM, Eric Prud'hommeaux wrote:
> > . . .
> > One could implement URDNA without ever serializing N-Quads with a
> > hashCode() function on Quad, BNode, IRI and Literal: . . . >
> > [[
> > Hash Quad.urdna (h: Hasher) {
> >    return h.add(
> >      g.urdna(h), h(" "),
> >      s.urdna(h), h(" "),
> >      p.urdna(h), h(" "),
> >      o.urdna(h), " .\n"
> >    );
> > }
> > 
> > Hash IRI.urdna (h: Hasher) {
> >    return h.add('<', value, '>');
> > }
> > 
> > Hash BNode.urdna (h: Hasher) {
> >    static int ordinal = 1;
> >    return h.add("_:c14n_" + ordinal++);
> > }
> > ]]
> > 
> > Literal elided, and you'd want conditional code around Quad.g, but you
> > get the idea. Note that this approach means you never need to
> > duplicate your graph in memory as you do when constructing a
> > potentially giant N-Quads string.
> 
> If I'm understanding your intent, you are essentially showing that the hash
> can be computed on the fly as a serialization is produced piecemeal, without
> ever materializing the whole serialization at once.

That's correct. Such an optimization would allow you to operate in
considerably less memory (and exemplifies Aidan's point that this is
an operation over an abstract RDF graph).


> That could be done, but that is exactly what I would advise AGAINST doing,
> because in essence you are defining a home grown hashing algorithm, even if
> your algorithm makes use of an off-the-shelf hashing algorithm as a
> component.  From a security perspective, I think it is unwise to do that.

But it's NOT a home-grown hashing algorithm; it's URDNA (with BNOde
ordering oversimplified for brevity). I was just pointing out that
URDNA starts with an abstract RDF graph and produces a consistent
ordering and BNode labeling, and finally, a hash.


> > > From a geek perspective, this abstract hashing function looks a little
> > arbitrary 'cause you could get just as strong a hash if you eliminated
> > all of the string constants (" ", "<", ">", ...). But keeping them is
> > cheap and allows you to serialize the result in a widely-supported RDF
> > syntax. RDF C14N, XML C14N and JWS's C14N all canonicalize to
> > serializable data, which makese it easier to implement and makes
> > user's lives way easier 'cause they can printf the input to the
> > hashing function.
> > 
> > I believe that describing hashing as a function over N-Quads suggests
> > that folks need to serialize as N-Quads to use it. I think the wording
> > describing it as function over an (abstract) RDF Graph (Dataset,
> > whatever) to say exactly what I as an RDF user want to know.
> 
> I appreciate that desire, but I think defining the hash over the abstract
> RDF would raise security concerns that would be better avoided, because it
> inherently entangles the hashing algorithm with RDF-specific code.  I think
> a safer approach is to use an off-the-shelf cryptographic hashing algorithm
> on the fully materialized bytes of canonical N-Quads, which is essentially
> what the charter currently describes.

In the email that started this sub-sub-sub-thread, you said:

|| But I think at least one part of the confusion and concern is that
|| the proposed canonicalization is framed as an *abstract* RDF Dataset
|| canonicalization."

I think that means we're wordsmithing the description of RDF Dataset
Hash [artfully bulleted by myself]:
[[
This specification details the processing steps of a hash function for
an arbitrary RDF Dataset. These step include

(1) the generation of a canonical form using the algorithm specified
    in the “RDF Dataset Canonicalization” deliverable,

(2) sorting the N-Quads serialization of the canonical form generated
    in the previous step, and

(3) application of a cryptographic hash function on the sorted N-Quads
    representation.
]]

I think this text already says what you want about N-Quads. Can you
say what you'd want to change?


BTW, you referenced security concerns above. I order to make this more
concrete, I suggest that those security concerns would be:

SC1. (utility) does it get the same hash for any isomorphism of an RDF
     graph (regardless of triple/quad order)? (yes)

SC2. (safety) does every pair of non-isomorphic graphs give distinct
     hashes? (yes)

SC3. (safety) can one combine the hashes of subgraphs to force a
     specific hash value for a graph? (no)

I think that bullet (3) above answer's SC3 without changes to the
charter. SC1 and SC2 require a deep dive into RDC.


> Thanks,
> David Booth
> 
> > 
> > 
> > > > On 2021-06-11 17:14, David Booth wrote
> > > > > 2. It is misleading, because *any* RDF canonicalization algorithm is
> > > > > fundamentally about serialization -- *not* the abstract RDF Dataset. The
> > > > > proposed algorithm is really only abstract in the sense that it can be
> > > > > used with a family of serializations.
> > 
> > I think the above shows that it's not fundamentally about a
> > serialization, more "compatible with" serialization. It doesn't
> > undermine your diff/patch utility below, but it does convey that it's
> > a function over a Dataset, not a string of unicode characters.
> > 
> > 
> > > > 2) It is true that the RDF standard states that "Blank node identifiers
> > > > are not part of the RDF abstract syntax, but are entirely dependent on
> > > > the concrete syntax or implementation." The standard also states that
> > > > "the set of possible blank nodes is arbitrary", and to have a set of
> > > > elements, we must be able to establish equality and inequality over
> > > > those elements; otherwise the set is not defined. So how can we deal
> > > > computationally with an arbitrary set of elements? We label blank nodes
> > > > in a one-to-one manner as a proxy and work with the labels. (We could,
> > > > equivalently, consider the set of labels to *be* the "arbitrary set of
> > > > blank nodes" mentioned by the standard.)
> > > 
> > > That's an interesting point!  You're right.  It hadn't occurred to me that
> > > the labels themselves could be considered the "arbitrary" set of blank
> > > nodes.
> > > 
> > > But again, I see that as a purely theoretical point, because from a
> > > practical perspective, in order to do anything useful with a canonicalized
> > > RDF dataset -- such as storing it, transmitting it, comparing it to another
> > > canonicalized RDF dataset, or computing its hash -- then you either need to
> > > serialize it, or you need to create RDF-specific tools for those operations,
> > > which would pretty much defeat the purpose of doing the canonicalization in
> > > the first place.  After all, canonicalization is not an end goal, it is
> > > merely a means to enable other operations to be performed more easily or
> > > without requiring RDF-specific tools.
> > > 
> > > For example, to my mind a key purpose in standardizing a canonical form of
> > > N-Quads (aside from digital signatures) is *specifically* to enable
> > > *standard* diff and patch tools to work on RDF.
> > > 
> > > Thanks,
> > > David Booth
> > > 
> > > > If the labels of two blank nodes are equal, the blank nodes are equal.
> > > > If the labels are not equal, the blank nodes are not equal. We can use
> > > > string libraries for this. Almost every implementation that works with
> > > > (abstract) RDF datasets does this; it's not something particular to
> > > > canonicalisation. So this is part of an implementation of the algorithm
> > > > and is completely compatible with what the standard suggests of abstract
> > > > RDF datasets, and those implementations that work with them.†
> > > > 
> > > > 
> > > > 
> > > > 
> > > > That said, I agree with you that the explainer could be made more
> > > > precise in this manner (without affecting readability) in that it does
> > > > refer to blank node labels as part of the dataset rather than the
> > > > implementation. Here are the problematic sentences I can see:
> > > > 
> > > > * "Two isomorphic RDF Datasets are identical except for differences in
> > > > how their individual blank nodes are labeled." This could be removed to
> > > > just leave the sentence that follows it: "In particular, R is isomorphic
> > > > with S if and only if it is possible to map the blank nodes of R to the
> > > > blank nodes of S in a one-to-one manner, generating an RDF dataset R'
> > > > such that R' = S." (This itself could be more formal, but I think that
> > > > would be out of scope for the explainer; also it is defined elsewhere.)
> > > > 
> > > > * "Such a canonicalization function can be implemented, in practice, as
> > > > a procedure that deterministically re-labels all blank nodes of an RDF
> > > > Dataset in a one-to-one manner" We could change "re-labels" to simply
> > > > "labels" to avoid the impression that the (abstract) RDF Dataset already
> > > > has labels.
> > > > 
> > > > * "without depending on the particular set of blank node labels used in
> > > > the input RDF Dataset" We could change to "without depending on the
> > > > particular set of blank node identifiers used in the serialization of
> > > > the input RDF Dataset"
> > > > 
> > > > * "It could also be referred to as a “canonical relabelling scheme”"
> > > > This could rather be "It could also be referred to as a “canonical
> > > > labeling scheme”", to avoid implying that labels already exist.
> > > > 
> > > > I have created a PR for this:
> > > >       https://github.com/w3c/lds-wg-charter/pull/91
> > > > 
> > > > If there's some other sentence that you see as imprecise in that way,
> > > > let me/us know!
> > > > 
> > > > Best,
> > > > Aidan
> > > > 
> > > > † Well, there is a perhaps one weird issue that if "the set of possible
> > > > blank nodes is arbitrary", then it could also be a uncountably infinite
> > > > set that cannot be "labelled" one-to-one by finite strings. The set of
> > > > blank nodes could be the set of reals, for example. So when we talk
> > > > about labelling blank nodes, we assume that they are countable (finite
> > > > or countably infinite). I am not aware that anything in the RDF standard
> > > > restricts the set of blank nodes to be finite, countably infinite, or
> > > > something else. Just to clarify that if this does turn out to be a
> > > > problem, it will likely be a theoretical flaw for most standards built
> > > > on top of RDF, and probably for the semantics of RDF graphs. I don't
> > > > know if this actually matters in any meaningful sense, but I think it
> > > > would be slightly comforting to know that somewhere, something normative
> > > > says or implies that the set of blank nodes is countable. :)
> > > > 
> > > 
> > 
>
Received on Sunday, 13 June 2021 09:00:06 UTC