Re: Signing and Verifying RDF Datasets for Dummies (like Me!)

Peter,

I have tried to implement your canonicalization algorithm as SPARQL:

# SELECT  (GROUP_CONCAT(?quadStr ; separator='\n') AS ?nQuads)
SELECT  (SHA1(GROUP_CONCAT(?quadStr ; separator=' \n')) AS ?hash)
WHERE
  { { SELECT DISTINCT  ?g ?s ?p ?o
      WHERE
        {   { ?s  ?p  ?o }
          UNION
            { GRAPH ?g
                { ?s  ?p  ?o }
            }
        }
    }
    BIND(concat("<", str(?s), ">") AS ?sStr)
    BIND(concat("<", str(?p), ">") AS ?pStr)
    BIND(if(isURI(?o), concat("<", str(?o), ">"), concat("\"",
str(?o), "\"", if(( lang(?o) != "" ), concat("@", str(lang(?o))),
concat("^^<", str(datatype(?o)), ">")))) AS ?oStr)
    BIND(concat(?sStr, " ", ?pStr, " ", ?oStr, " ", if(bound(?g),
concat("<", str(?g), ">", " "), ""), ".") AS ?quadStr)
  }
ORDER BY ?g ?s ?p ?o datatype(?o) lcase(lang(?o))

Blank nodes are ignored.

It works to the extent that:
* I got ?hash of an N-Quads test file
* round-tripped the test file as ?nQuads (using SELECT that is commented out)
* I got ?hash of the round-tripped N-Quads (this part requires pulling
them out of SPARQL results syntax such as XML or CSV)
* both ?hash values matched using Jena

One thing this query fails on is serializing literal values with
newlines, which are not allowed in N-Triples/N-Quads. Can anyone
suggest how str(?o) should be replaced to fix that?


Martynas
atomgraph.com

On Mon, Jun 7, 2021 at 9:46 PM Peter Patel-Schneider
<pfpschneider@gmail.com> wrote:
>
> Here's my version of "Signing and Verifying RDF Datasets for Dummies".
>
>
> If you want to sign and verify documents (sequences of Unicode code
> points), encode the document in utf-8 and sign and verify a hash of the
> octet sequence.  Transmit the octet sequence along with the signed
> hash.
>
> If you want to sign and verify RDF datasets, serialize the dataset in
> N-Quads and sign and verify that document.  When a receiver
> deserializes the document the result will be isomorphic to the dataset
> that the sender had.   Don't use a syntax that allows relative IRIs
> (e.g., Turtle) as relative IRIs may turn into different absolute IRIs
> when the document is deserialized.  Don't use a syntax that allows
> remote resources to affect deserialization (e.g., JSON-LD) as these
> remote resources can be modified by an attacker.  Don't use a syntax
> where parts of the document that don't serialize parts of the datatset
> look as if they might be important (e.g., RDFa) as receivers might come
> to depend on these non-coding parts.  Don't use a syntax where it is
> not obvious which parts of the document serialize parts of the dataset
> (e.g., JSON-LD) as receivers might be confused as to just what dataset
> is being transmitted.  Don't use a syntax where the mapping from the
> serialization to the dataset is poorly defined in practice (e.g., JSON-
> LD).
>
> If you want to sign and verify RDF datasets and you want isomorphic RDF
> datasets to have the same signature, you first need to define a
> canonical serialization for RDF datasets so that isomorphic RDF
> datasets have the same canonical form.  To sign and verify, create the
> canonical serialization for the RDF dataset and sign and verify that.
> Use N-Quads for this canonical form for the reasons above.  Don't
> transmit any encoding other than the N-Quads canonical form for the
> reasons above, and more.  If you don't want to depend on a complex
> algorithm to produce the canonical form then forbid blank nodes.
>
> This pretty much boils down to just using and only transmitting the
> simplest and most transparent document format possible because anything
> else just adds extra problems and that N-Quads is the simplest and most
> transparent document format for RDF datasets.
>
>
> My takeaway from this is that any W3C WG that is trying to standardize
> something that involves signing and verifying RDF datasets should only
> use N-Quads to transmit these datasets.
>
>
> peter
>
>
>

Received on Tuesday, 8 June 2021 10:17:07 UTC