Graph terms, provenance and revision control

While the Provenance WG is working on a general description of provenance, I want to address specifically the provenance of RDF triples since the lack 
of a proper vocabulary has been a bit of a hindrance to the RDF WG.

I suggest that describing the provenance of RDF statements might be better understood in terms of revision control(RC). A revision control vocabulary 
helps address several of the open issues relating to graphs and provides a more familiar terminology for the g-* terms. I think RC encompasses a lot 
of the use cases for graphs and is a familiar vocabulary that is commonly understood. A comparison between existing terms:

Concept                g-*     NGPT         SPARQL       RC                Suggested
serialized triples     g-text  document     results      file,patch,diff   graph literal
graph (immutable set)  g-snap  graph        graph        version           RDF graph
graph(s) w/identifier          NamedGraphs  Dataset      commit            assertion
mutable graph          g-box                Graph Store  branch,tag,label  repository

ISSUE-32 which asks "Can we identify both g-boxes and g-snaps?" Yes! Using a RC vocabulary, RDF graphs, aka g-snaps, are like versions of a file. A 
serialized version is a graph literal. A mutable container (g-box) is like a branch. It represents a set of changes to a graph over time. An assertion 
of a graph or set of graphs is like a commit. Assertions provide the additional layer of indirection to describe provenance information  and graph 
composition. A repository is the engine that provides access to graphs(assertions) and their metadata.

The behavior desired seems very similar to that provided by Git. Git manages immutable sets over time and provides provenance and time metadata. 
Similarly, a repository (aka store) should provide access to assertions and metadata. Because assertions are local to the repository and immutable 
like a commit, its easy to assign a unique identifier to them. Graphs are no longer needed to be directly addressed, removing the need to give names 
to graphs which don't have any intrinsic name to begin with. Instead, you refer to an assertion. This has the added benefit that you always know what 
set of triples you are working with like when specifying a commit. A local identifier similar to HEAD can also be used to provide access to the 
current state of a particular branch providing the functionality of a mutable container.

Naming graphs has been a hot topic, whether or not a URI names a graph or simply labels it as is the case with SPARQL Datasets. An assertion is just 
like any other resource and has a URI that uniquely identifies it. It is the responsibility of the repository to provide the mapping('blessing') 
between the assertion and the triples themselves. Just as with Git, repositories can be cloned and that association between URI and triples can be 
moved between repositories.

I'm not suggesting any specific vocabulary nor do I think the RDF WG should recommend one. I think that just as with RCS, repositories should be free 
to define their own RC schemas. But, by identifying the the graph use cases as part of the RC domain, I think this will help bring some clarity to the 
RDF WG discussions and lead to the development of specific RC schemas.

I've tried to be brief so I left a lot of specifics and related discussions/tangents out. I can expand on these ideas more, but hopefully this is 
enough to get the rough idea across. From what I've seen of the recent discussions, it looks like there is starting to be some convergence here. I 
look forward to your feedback.


-m

Received on Wednesday, 12 October 2011 16:11:46 UTC