- From: Gavin Carothers <gavin@topquadrant.com>
- Date: Thu, 23 Jun 2011 16:39:30 -0700
- To: RDF-WG WG <public-rdf-wg@w3.org>
One of the commonly mentioned uses cases for named graphs and graph literals is their use with digital signatures. At the moment signing an RDF graph is impossible. Today applications and systems can sign serializations of an RDF graph. However, there are issues with all of those serializations, and the whole notion of signing the serialized form. For example as we know a given RDF graph can be serialized into a huge number of equivalent RDF/XML files. We also know that RDF/XML can't represent all RDF graphs. If we leave the creation of a canonical form up to each serialization we are likely to end up with a large number of canonicalization methods. One for RDF/XML, one for Turtle, one for JSON-LD, one for RDF-JSON, etc. If we look at RDFa, the situation gets really strange. What the heck do you sign? The whole HTML, what if you just want to sign the data? It seems that creating a standard canonical byte stream based on an abstract syntax graph would be very useful. It also seems reasonably easy. The simplest naive approach of sorting N-Triples works reasonably well. Until that is you encounter our good old friend Blank Nodes. Happily the majority of the work needed to address blank nodes can be found in Jeremy Carroll's paper "Signing RDF Graphs" http://www.hpl.hp.com/techreports/2003/HPL-2003-142.html from back in 2003. What follows is a simple summary of the method in "Signing RDF Graphs 2003". Please read the paper for all the gory details. Basically, we sort the triples in lexicographic order, ignoring blank nodes. There are some remaining issues with different triples that are identical when ignoring blank nodes. It is usually possible to number the blank nodes in document order (in the output document), to break these ties. There are cases for which it is not possible to make such a tie-breaking numbering. These cases can be resolved before saving the document by modifying the graph by explicitly and nondeterministically adding formally meaningless triples. Output it all as N-Triples. This process is always O(n log (n)), there are also a number of possible optimizations mentioned in the paper. The nondeterministic nature of the process isn't great. Happily lots and lots of real world RDF graphs will never get to the nondeterministic step. In fact it would be perfectly reasonable for a system or application present giant red flashing lights when publishing such data. There is also a bit of an issue in that at the moment the paper defines the canonical graph as after deleting all non-distinctive triples the first occurrence of each blank node identifier appear in numeric order starting at 1 without gaps. It is possible to make sure that the blank node labels stay consistent as well. There are other non-signing uses. By using a line oriented canonical form existing version control and differencing tools will work right out of the box. A normal unified diff is suddenly useful not just for code, but for data. Welcome further discussion and feedback, specifically from people looking at implementing canonicalization in a specific serialization, and if this can meet your use cases. Cheers, Gavin PS. House keeping, the addition of a semantics free predicate may mean reopening ISSUE-56.
Received on Thursday, 23 June 2011 23:39:59 UTC