Re: Thoughts on the LDS WG chartering discussion from David Booth on 2021-06-14 (semantic-web@w3.org from June 2021)

From: David Booth <david@dbooth.org>
Date: Mon, 14 Jun 2021 00:33:05 -0400
To: semantic-web@w3.org
Message-ID: <54a5ee55-887a-e594-84f7-ad010bcb74d3@dbooth.org>
Hi Eric,

On 6/13/21 4:59 AM, Eric Prud'hommeaux wrote:
> On Sat, Jun 12, 2021 at 10:10:35PM -0400, David Booth wrote:
>> On 6/12/21 11:34 AM, Eric Prud'hommeaux wrote:
>>> . . .
>>> One could implement URDNA without ever serializing N-Quads with a
>>> hashCode() function on Quad, BNode, IRI and Literal: . . . >
>>> [[
>>> Hash Quad.urdna (h: Hasher) {
>>>     return h.add(
>>>       g.urdna(h), h(" "),
>>>       s.urdna(h), h(" "),
>>>       p.urdna(h), h(" "),
>>>       o.urdna(h), " .\n"
>>>     );
>>> }
>>>
>>> Hash IRI.urdna (h: Hasher) {
>>>     return h.add('<', value, '>');
>>> }
>>>
>>> Hash BNode.urdna (h: Hasher) {
>>>     static int ordinal = 1;
>>>     return h.add("_:c14n_" + ordinal++);
>>> }
>>> ]]
>>>
>>> Literal elided, and you'd want conditional code around Quad.g, but you
>>> get the idea. Note that this approach means you never need to
>>> duplicate your graph in memory as you do when constructing a
>>> potentially giant N-Quads string.
>>
>> If I'm understanding your intent, you are essentially showing that the hash
>> can be computed on the fly as a serialization is produced piecemeal, without
>> ever materializing the whole serialization at once.
> 
> That's correct. Such an optimization would allow you to operate in
> considerably less memory (and exemplifies Aidan's point that this is
> an operation over an abstract RDF graph).
> 
> 
>> That could be done, but that is exactly what I would advise AGAINST doing,
>> because in essence you are defining a home grown hashing algorithm, even if
>> your algorithm makes use of an off-the-shelf hashing algorithm as a
>> component.  From a security perspective, I think it is unwise to do that.
> 
> But it's NOT a home-grown hashing algorithm; it's URDNA (with BNOde
> ordering oversimplified for brevity). I was just pointing out that
> URDNA starts with an abstract RDF graph and produces a consistent
> ordering and BNode labeling, and finally, a hash.

I'm confused by your terminology, because AFAIK URDNA2015 is only the 
canonicalization algorithm -- not the hashing algorithm.
https://json-ld.github.io/rdf-dataset-canonicalization/spec/index.html
I think you meant the "RDF Dataset Hash (RDH)" algorithm?
https://w3c.github.io/lds-wg-charter/index.html

But in any case, I disagree with your claim that the algorithm that you 
sketched above is not a home-grown hashing algorithm.  *YOU* wrote the 
algorithm above: it is *not* an off-the-shelf hashing algorithm, even if 
it makes use of URDNA2015 and an off-the-shelf hashing algorithm as a 
parameter/component.

Vulnerabilities can be introduced even in seemingly simple variations of 
an algorithm.  I'm not claiming that your particular algorithm *has* a 
vulnerability, but I *am* claiming that we don't really know whether or 
not it has a vulnerability, because it has not been highly scrutinized 
by the world's crypto experts, and I don't think W3C should get involved 
in that question.  That's why I think it so important for this work to 
use an off-the-shelf hash algorithm to hash the fully materialized bytes 
of canonical N-Quads, which is essentially what the charter already 
proposes.  Ivan views the canonical N-Quads as being equivalent to the 
canonicalized abstract RDF dataset, and I agree in theory, but in 
practice I think the distinction is *very* important *because* of the 
importance of using an unmodified off-the-shelf hashing algorithm on the 
fully *materialized* bytes.

> 
>>>>  From a geek perspective, this abstract hashing function looks a little
>>> arbitrary 'cause you could get just as strong a hash if you eliminated
>>> all of the string constants (" ", "<", ">", ...). But keeping them is
>>> cheap and allows you to serialize the result in a widely-supported RDF
>>> syntax. RDF C14N, XML C14N and JWS's C14N all canonicalize to
>>> serializable data, which makese it easier to implement and makes
>>> user's lives way easier 'cause they can printf the input to the
>>> hashing function.
>>>
>>> I believe that describing hashing as a function over N-Quads suggests
>>> that folks need to serialize as N-Quads to use it. I think the wording
>>> describing it as function over an (abstract) RDF Graph (Dataset,
>>> whatever) to say exactly what I as an RDF user want to know.
>>
>> I appreciate that desire, but I think defining the hash over the abstract
>> RDF would raise security concerns that would be better avoided, because it
>> inherently entangles the hashing algorithm with RDF-specific code.  I think
>> a safer approach is to use an off-the-shelf cryptographic hashing algorithm
>> on the fully materialized bytes of canonical N-Quads, which is essentially
>> what the charter currently describes.
> 
> In the email that started this sub-sub-sub-thread, you said:
> 
> || But I think at least one part of the confusion and concern is that
> || the proposed canonicalization is framed as an *abstract* RDF Dataset
> || canonicalization."
> 
> I think that means we're wordsmithing the description of RDF Dataset
> Hash [artfully bulleted by myself]:
> [[
> This specification details the processing steps of a hash function for
> an arbitrary RDF Dataset. These step include
> 
> (1) the generation of a canonical form using the algorithm specified
>      in the “RDF Dataset Canonicalization” deliverable,
> 
> (2) sorting the N-Quads serialization of the canonical form generated
>      in the previous step, and
> 
> (3) application of a cryptographic hash function on the sorted N-Quads
>      representation.
> ]] >
> I think this text already says what you want about N-Quads. Can you
> say what you'd want to change?

I basically agree with the existing algorithmic approach, *because* it 
takes the hash of the fully materialized N-Quads.  The changes I suggest 
are:

1. I think the N-Quads canonicalization algorithm needs to be a *named* 
deliverable (possibly under its own charter), so that it can be easily 
referenced and used for other purposes.

2. I think that the diff/patch use case should be included in the 
Explainer, and the proposed canonicalization algorithm should be tested 
and evaluated against the diff/patch use case, to see how well it 
performs.  I don't expect any canonicalization algorithm to be perfect 
for diff/patch.  If it is "good enough", then that's fine with me.  But 
if it performs poorly for common diff/patch use cases, then I think 
improvements should at least be seriously attempted before standardization.

3. I think the proposed hash should be characterized as being a hash of 
canonical N-Quads -- *not* a hash of the *abstract* RDF dataset.  I 
think it is slightly misleading (and causes confusion) to describe the 
hash as being a hash of the *abstract* RDF dataset.

> 
> BTW, you referenced security concerns above. I order to make this more
> concrete, I suggest that those security concerns would be:
> 
> SC1. (utility) does it get the same hash for any isomorphism of an RDF
>       graph (regardless of triple/quad order)? (yes)
> 
> SC2. (safety) does every pair of non-isomorphic graphs give distinct
>       hashes? (yes)
> 
> SC3. (safety) can one combine the hashes of subgraphs to force a
>       specific hash value for a graph? (no)
> 
> I think that bullet (3) above answer's SC3 without changes to the
> charter. SC1 and SC2 require a deep dive into RDC.

I don't believe that W3C should be doing work that raises those 
questions.  Those are questions that crypto experts should be asking 
about *generic* hash algorithms -- not RDF-specific algorithms.  That's 
why I think it is so important to cleanly separate the N-Quads 
canonicalization from the hash computation.

Thanks,
David Booth
Received on Monday, 14 June 2021 04:33:50 UTC