- From: michael <michael@thinknasium.org>
- Date: Wed, 12 Oct 2011 07:21:29 -0700
- To: public-rdf-comments@w3.org
While the Provenance WG is working on a general description of provenance, I want to address specifically the provenance of RDF triples since the lack of a proper vocabulary has been a bit of a hindrance to the RDF WG. I suggest that describing the provenance of RDF statements might be better understood in terms of revision control(RC). A revision control vocabulary helps address several of the open issues relating to graphs and provides a more familiar terminology for the g-* terms. I think RC encompasses a lot of the use cases for graphs and is a familiar vocabulary that is commonly understood. A comparison between existing terms: Concept g-* NGPT SPARQL RC Suggested serialized triples g-text document results file,patch,diff graph literal graph (immutable set) g-snap graph graph version RDF graph graph(s) w/identifier NamedGraphs Dataset commit assertion mutable graph g-box Graph Store branch,tag,label repository ISSUE-32 which asks "Can we identify both g-boxes and g-snaps?" Yes! Using a RC vocabulary, RDF graphs, aka g-snaps, are like versions of a file. A serialized version is a graph literal. A mutable container (g-box) is like a branch. It represents a set of changes to a graph over time. An assertion of a graph or set of graphs is like a commit. Assertions provide the additional layer of indirection to describe provenance information and graph composition. A repository is the engine that provides access to graphs(assertions) and their metadata. The behavior desired seems very similar to that provided by Git. Git manages immutable sets over time and provides provenance and time metadata. Similarly, a repository (aka store) should provide access to assertions and metadata. Because assertions are local to the repository and immutable like a commit, its easy to assign a unique identifier to them. Graphs are no longer needed to be directly addressed, removing the need to give names to graphs which don't have any intrinsic name to begin with. Instead, you refer to an assertion. This has the added benefit that you always know what set of triples you are working with like when specifying a commit. A local identifier similar to HEAD can also be used to provide access to the current state of a particular branch providing the functionality of a mutable container. Naming graphs has been a hot topic, whether or not a URI names a graph or simply labels it as is the case with SPARQL Datasets. An assertion is just like any other resource and has a URI that uniquely identifies it. It is the responsibility of the repository to provide the mapping('blessing') between the assertion and the triples themselves. Just as with Git, repositories can be cloned and that association between URI and triples can be moved between repositories. I'm not suggesting any specific vocabulary nor do I think the RDF WG should recommend one. I think that just as with RCS, repositories should be free to define their own RC schemas. But, by identifying the the graph use cases as part of the RC domain, I think this will help bring some clarity to the RDF WG discussions and lead to the development of specific RC schemas. I've tried to be brief so I left a lot of specifics and related discussions/tangents out. I can expand on these ideas more, but hopefully this is enough to get the rough idea across. From what I've seen of the recent discussions, it looks like there is starting to be some convergence here. I look forward to your feedback. -m
Received on Wednesday, 12 October 2011 16:11:46 UTC