Graph terms, provenance and revision control from michael on 2011-10-12 (public-rdf-comments@w3.org from October 2011)

From: michael <michael@thinknasium.org>
Date: Wed, 12 Oct 2011 07:21:29 -0700
To: public-rdf-comments@w3.org
Message-ID: <4E95A269.4090903@thinknasium.org>

While the Provenance WG is working on a general description of provenance, I want to address specifically the provenance of RDF triples since the lack
of a proper vocabulary has been a bit of a hindrance to the RDF WG.

I suggest that describing the provenance of RDF statements might be better understood in terms of revision control(RC). A revision control vocabulary
helps address several of the open issues relating to graphs and provides a more familiar terminology for the g-* terms. I think RC encompasses a lot
of the use cases for graphs and is a familiar vocabulary that is commonly understood. A comparison between existing terms:

Concept g-* NGPT SPARQL RC Suggested
serialized triples g-text document results file,patch,diff graph literal
graph (immutable set) g-snap graph graph version RDF graph
graph(s) w/identifier NamedGraphs Dataset commit assertion
mutable graph g-box Graph Store branch,tag,label repository

ISSUE-32 which asks "Can we identify both g-boxes and g-snaps?" Yes! Using a RC vocabulary, RDF graphs, aka g-snaps, are like versions of a file. A
serialized version is a graph literal. A mutable container (g-box) is like a branch. It represents a set of changes to a graph over time. An assertion
of a graph or set of graphs is like a commit. Assertions provide the additional layer of indirection to describe provenance information and graph
composition. A repository is the engine that provides access to graphs(assertions) and their metadata.

The behavior desired seems very similar to that provided by Git. Git manages immutable sets over time and provides provenance and time metadata.
Similarly, a repository (aka store) should provide access to assertions and metadata. Because assertions are local to the repository and immutable
like a commit, its easy to assign a unique identifier to them. Graphs are no longer needed to be directly addressed, removing the need to give names
to graphs which don't have any intrinsic name to begin with. Instead, you refer to an assertion. This has the added benefit that you always know what
set of triples you are working with like when specifying a commit. A local identifier similar to HEAD can also be used to provide access to the
current state of a particular branch providing the functionality of a mutable container.

Naming graphs has been a hot topic, whether or not a URI names a graph or simply labels it as is the case with SPARQL Datasets. An assertion is just
like any other resource and has a URI that uniquely identifies it. It is the responsibility of the repository to provide the mapping('blessing')
between the assertion and the triples themselves. Just as with Git, repositories can be cloned and that association between URI and triples can be
moved between repositories.

I'm not suggesting any specific vocabulary nor do I think the RDF WG should recommend one. I think that just as with RCS, repositories should be free
to define their own RC schemas. But, by identifying the the graph use cases as part of the RC domain, I think this will help bring some clarity to the
RDF WG discussions and lead to the development of specific RC schemas.

I've tried to be brief so I left a lot of specifics and related discussions/tangents out. I can expand on these ideas more, but hopefully this is
enough to get the rough idea across. From what I've seen of the recent discussions, it looks like there is starting to be some convergence here. I
look forward to your feedback.

-m

Received on Wednesday, 12 October 2011 16:11:46 UTC