Need for a canonical byte stream for an RDF graph from Gavin Carothers on 2011-06-23 (public-rdf-wg@w3.org from June 2011)

From: Gavin Carothers <gavin@topquadrant.com>
Date: Thu, 23 Jun 2011 16:39:30 -0700
To: RDF-WG WG <public-rdf-wg@w3.org>
Message-ID: <BANLkTikTyk_87ORYYSZjfWfPapVz8VhMDg@mail.gmail.com>
One of the commonly mentioned uses cases for named graphs and graph
literals is their use with digital signatures. At the moment signing
an RDF graph is impossible. Today applications and systems can sign
serializations of an RDF graph. However, there are issues with all of
those serializations, and the whole notion of signing the serialized
form. For example as we know a given RDF graph can be serialized into
a huge number of equivalent RDF/XML files. We also know that RDF/XML
can't represent all RDF graphs. If we leave the creation of a
canonical form up to each serialization we are likely to end up with a
large number of canonicalization methods. One for RDF/XML, one for
Turtle, one for JSON-LD, one for RDF-JSON, etc. If we look at RDFa,
the situation gets really strange. What the heck do you sign? The
whole HTML, what if you just want to sign the data?

It seems that creating a standard canonical byte stream based on an
abstract syntax graph would be very useful. It also seems reasonably
easy. The simplest naive approach of sorting N-Triples works
reasonably well. Until that is you encounter our good old friend Blank
Nodes. Happily the majority of the work needed to address blank nodes
can be found in Jeremy Carroll's paper "Signing RDF Graphs"
http://www.hpl.hp.com/techreports/2003/HPL-2003-142.html from back in
2003.

What follows is a simple summary of the method in "Signing RDF Graphs
2003". Please read the paper for all the gory details. Basically, we
sort the triples in lexicographic order, ignoring blank nodes. There
are some remaining issues with different triples that are identical
when ignoring blank nodes. It is usually possible to number the blank
nodes in document order (in the output document), to break these ties.
There are cases for which it is not possible to make such a
tie-breaking numbering. These cases can be resolved before saving the
document by modifying the graph by explicitly and nondeterministically
adding formally meaningless triples. Output it all as N-Triples. This
process is always O(n log (n)), there are also a number of possible
optimizations mentioned in the paper.

The nondeterministic nature of the process isn't great. Happily lots
and lots of real world RDF graphs will never get to the
nondeterministic step. In fact it would be perfectly reasonable for a
system or application present giant red flashing lights when
publishing such data.

There is also a bit of an issue in that at the moment the paper
defines the canonical graph as after deleting all non-distinctive
triples the first occurrence of each blank node identifier appear in
numeric order starting at 1 without gaps. It is possible to make sure
that the blank node labels stay consistent as well.

There are other non-signing uses. By using a line oriented canonical
form existing version control and differencing tools will work right
out of the box. A normal unified diff is suddenly useful not just for
code, but for data.

Welcome further discussion and feedback, specifically from people
looking at implementing canonicalization in a specific serialization,
and if this can meet your use cases.

Cheers,
Gavin

PS. House keeping, the addition of a semantics free predicate may mean
reopening ISSUE-56.
Received on Thursday, 23 June 2011 23:39:59 UTC