- From: David Booth <david@dbooth.org>
- Date: Sat, 12 Jun 2021 22:10:35 -0400
- To: semantic-web@w3.org
On 6/12/21 11:34 AM, Eric Prud'hommeaux wrote: > . . . > One could implement URDNA without ever serializing N-Quads with a > hashCode() function on Quad, BNode, IRI and Literal: . . . > > [[ > Hash Quad.urdna (h: Hasher) { > return h.add( > g.urdna(h), h(" "), > s.urdna(h), h(" "), > p.urdna(h), h(" "), > o.urdna(h), " .\n" > ); > } > > Hash IRI.urdna (h: Hasher) { > return h.add('<', value, '>'); > } > > Hash BNode.urdna (h: Hasher) { > static int ordinal = 1; > return h.add("_:c14n_" + ordinal++); > } > ]] > > Literal elided, and you'd want conditional code around Quad.g, but you > get the idea. Note that this approach means you never need to > duplicate your graph in memory as you do when constructing a > potentially giant N-Quads string. If I'm understanding your intent, you are essentially showing that the hash can be computed on the fly as a serialization is produced piecemeal, without ever materializing the whole serialization at once. That could be done, but that is exactly what I would advise AGAINST doing, because in essence you are defining a home grown hashing algorithm, even if your algorithm makes use of an off-the-shelf hashing algorithm as a component. From a security perspective, I think it is unwise to do that. > >>From a geek perspective, this abstract hashing function looks a little > arbitrary 'cause you could get just as strong a hash if you eliminated > all of the string constants (" ", "<", ">", ...). But keeping them is > cheap and allows you to serialize the result in a widely-supported RDF > syntax. RDF C14N, XML C14N and JWS's C14N all canonicalize to > serializable data, which makese it easier to implement and makes > user's lives way easier 'cause they can printf the input to the > hashing function. > > I believe that describing hashing as a function over N-Quads suggests > that folks need to serialize as N-Quads to use it. I think the wording > describing it as function over an (abstract) RDF Graph (Dataset, > whatever) to say exactly what I as an RDF user want to know. I appreciate that desire, but I think defining the hash over the abstract RDF would raise security concerns that would be better avoided, because it inherently entangles the hashing algorithm with RDF-specific code. I think a safer approach is to use an off-the-shelf cryptographic hashing algorithm on the fully materialized bytes of canonical N-Quads, which is essentially what the charter currently describes. Thanks, David Booth > > >>> On 2021-06-11 17:14, David Booth wrote >>>> 2. It is misleading, because *any* RDF canonicalization algorithm is >>>> fundamentally about serialization -- *not* the abstract RDF Dataset. The >>>> proposed algorithm is really only abstract in the sense that it can be >>>> used with a family of serializations. > > I think the above shows that it's not fundamentally about a > serialization, more "compatible with" serialization. It doesn't > undermine your diff/patch utility below, but it does convey that it's > a function over a Dataset, not a string of unicode characters. > > >>> 2) It is true that the RDF standard states that "Blank node identifiers >>> are not part of the RDF abstract syntax, but are entirely dependent on >>> the concrete syntax or implementation." The standard also states that >>> "the set of possible blank nodes is arbitrary", and to have a set of >>> elements, we must be able to establish equality and inequality over >>> those elements; otherwise the set is not defined. So how can we deal >>> computationally with an arbitrary set of elements? We label blank nodes >>> in a one-to-one manner as a proxy and work with the labels. (We could, >>> equivalently, consider the set of labels to *be* the "arbitrary set of >>> blank nodes" mentioned by the standard.) >> >> That's an interesting point! You're right. It hadn't occurred to me that >> the labels themselves could be considered the "arbitrary" set of blank >> nodes. >> >> But again, I see that as a purely theoretical point, because from a >> practical perspective, in order to do anything useful with a canonicalized >> RDF dataset -- such as storing it, transmitting it, comparing it to another >> canonicalized RDF dataset, or computing its hash -- then you either need to >> serialize it, or you need to create RDF-specific tools for those operations, >> which would pretty much defeat the purpose of doing the canonicalization in >> the first place. After all, canonicalization is not an end goal, it is >> merely a means to enable other operations to be performed more easily or >> without requiring RDF-specific tools. >> >> For example, to my mind a key purpose in standardizing a canonical form of >> N-Quads (aside from digital signatures) is *specifically* to enable >> *standard* diff and patch tools to work on RDF. >> >> Thanks, >> David Booth >> >>> If the labels of two blank nodes are equal, the blank nodes are equal. >>> If the labels are not equal, the blank nodes are not equal. We can use >>> string libraries for this. Almost every implementation that works with >>> (abstract) RDF datasets does this; it's not something particular to >>> canonicalisation. So this is part of an implementation of the algorithm >>> and is completely compatible with what the standard suggests of abstract >>> RDF datasets, and those implementations that work with them.† >>> >>> >>> >>> >>> That said, I agree with you that the explainer could be made more >>> precise in this manner (without affecting readability) in that it does >>> refer to blank node labels as part of the dataset rather than the >>> implementation. Here are the problematic sentences I can see: >>> >>> * "Two isomorphic RDF Datasets are identical except for differences in >>> how their individual blank nodes are labeled." This could be removed to >>> just leave the sentence that follows it: "In particular, R is isomorphic >>> with S if and only if it is possible to map the blank nodes of R to the >>> blank nodes of S in a one-to-one manner, generating an RDF dataset R' >>> such that R' = S." (This itself could be more formal, but I think that >>> would be out of scope for the explainer; also it is defined elsewhere.) >>> >>> * "Such a canonicalization function can be implemented, in practice, as >>> a procedure that deterministically re-labels all blank nodes of an RDF >>> Dataset in a one-to-one manner" We could change "re-labels" to simply >>> "labels" to avoid the impression that the (abstract) RDF Dataset already >>> has labels. >>> >>> * "without depending on the particular set of blank node labels used in >>> the input RDF Dataset" We could change to "without depending on the >>> particular set of blank node identifiers used in the serialization of >>> the input RDF Dataset" >>> >>> * "It could also be referred to as a “canonical relabelling scheme”" >>> This could rather be "It could also be referred to as a “canonical >>> labeling scheme”", to avoid implying that labels already exist. >>> >>> I have created a PR for this: >>> https://github.com/w3c/lds-wg-charter/pull/91 >>> >>> If there's some other sentence that you see as imprecise in that way, >>> let me/us know! >>> >>> Best, >>> Aidan >>> >>> † Well, there is a perhaps one weird issue that if "the set of possible >>> blank nodes is arbitrary", then it could also be a uncountably infinite >>> set that cannot be "labelled" one-to-one by finite strings. The set of >>> blank nodes could be the set of reals, for example. So when we talk >>> about labelling blank nodes, we assume that they are countable (finite >>> or countably infinite). I am not aware that anything in the RDF standard >>> restricts the set of blank nodes to be finite, countably infinite, or >>> something else. Just to clarify that if this does turn out to be a >>> problem, it will likely be a theoretical flaw for most standards built >>> on top of RDF, and probably for the semantics of RDF graphs. I don't >>> know if this actually matters in any meaningful sense, but I think it >>> would be slightly comforting to know that somewhere, something normative >>> says or implies that the set of blank nodes is countable. :) >>> >> >
Received on Sunday, 13 June 2021 02:11:12 UTC