Support for Multiple Graphs and Graph Stores from michael on 2012-02-03 (public-rdf-comments@w3.org from February 2012)

From: michael <michael@thinknasium.org>
Date: Fri, 03 Feb 2012 12:25:41 -0800
To: public-rdf-comments@w3.org
Message-ID: <4F2C42C5.9050809@thinknasium.org>

One of the required features set forth in the charter is to "standardize a model and semantics for multiple graphs and graphs stores. ... The term
“Support for Multiple Graphs and Graph Stores” is used as a neutral term in this charter; this term is not and should not be considered as definitive.
The Working Group will have to define the right term(s)."[1]

In RDF, interpretations are performed over a single graph. The semantics provide guidance about how to merge/union graphs, but no attention is paid to
describing graphs separately because it is unnecessary. Additionally, as Pat recently said, "RDF has no notion of state or time or change in it
anywhere. ... If we are going to put that idea in, the change to RDF will be far more profound and far-reaching than anything we have considered so
far. The resulting language will not resemble current RDF at all at the semantic level."[2]

In practice, there is a need to describe and manage multiple graphs, over time and to track their provenance. This requires additional terms and
semantics[3] and is the motivation behind the required feature. I suggest that this is the domain of version control. Version control is a well known
solution for handling multiple immutable graphs and their metadata over time and multiple stores. Introducing a new vocabulary separate from the RDF
core semantics to describe RDF metadata will allow stores to better manage and exchange RDF data while keeping the semantics of RDF unchanged and
unaffected.

Version control addresses all of the categories of graph use cases: storage, publishing, querying and provenance. Version control also addresses many
of the open issues regarding graphs:

ISSUE-14: What is a named graph and what should we call it?
ISSUE-15: What is the relationship between the IRI and the triples in a dataset/quad-syntax/etc (SPARQL naming/labeling issue)
ISSUE-17: How are RDF datasets to be merged?
ISSUE-21: Can Node-IDs be shared between parts of a quad/multigraph format?
ISSUE-28: Do we need syntactic nesting of graphs (g-texts) as in N3?
ISSUE-29: Do we support SPARQL's notion of "default graph"?
ISSUE-32: Can we identify both g-boxes and g-snaps?
ISSUE-33: Do we provide a way to refer to sub-graphs?
ISSUE-35: Should there be an rdf:Graph construct, or something like that?
ISSUE-38: What new graph vocabulary should be added to describe graphs?
ISSUE rdfms-assertion: RDF is not just a data model; an RDF statement is an assertion (postponed from previous WG[4])

All of these issues are orthogonal to RDF interpretations which are performed over a single, nameless graph. While there is a need to address these
issues, in the interest of keeping the core RDF semantics as simple as possible, I suggest that the issues relating to graph management may be outside
the scope of this WG. This has already been suggested by Pat[2], David[3] and Richard[5].

To help illustrate that these issues can be addressed by version control, I'll present my suggestion for a new graph management vocabulary for RDF data.

Some revision control terms have already been mentioned by the WG: branches[6][7], trees[7], patches[8][9] and assertions[4][10]. Here is a comparison
of Sandro's g-* terminology[11] with a popular DVCS, Git and some terms I suggest for RDF graph management(RDFGM):

g-* Git RDFGM Description
----------------------------------------------------------------------------
g-text patch graph literal serialized set of RDF statements, triples
g-snap blob Graph set of RDF statements
tree Dataset description of one or more graphs/datasets
commit Assertion provenance for a dataset
g-box branch Branch dataset of assertions / label for assertions
repository Repository set of graphs and their metadata
git Store an engine that provides access to repositories

A g-text is the serialized content of a RDF graph[11], aka triples. This is similar to a patch in a revision control system. I prefer the term graph
literal which is a more accurate description.

A g-snap is an immutable set of RDF statements[11]. This is defined by RDF semantics to be a graph[12] and is the result of parsing a graph literal.
Graphs may only contain RDF statements, they cannot be nested. It's similar to a blob which represents the content of a file. Both graphs and blobs
are entirely defined by their data, thus allowing identical graphs to be represented by a single graph. Because graphs are immutable, simultaneous
access to graphs is no different than asynchronous access (use case 2 here[13]). Immutable graphs also simplify signing/endorsement (as discussed in
this thread[14]).

Because a graph should be treated like any other RDF resource [15], the relationship of URIs to graphs is one of naming. This means that a URI that
identifies a graph cannot identify any other resource. The association of the URI to the information resource(graph) is made by the store for a
particular repository when reading in statements. The serialization of this association between URI and a set of statements is described at the
syntactic level as presently done in TriG.

Datasets describe the relation between two or more graphs or datasets. This is similar to trees in Git which relate blobs and other trees to each
other. Datsets do not directly contain statements but are described in terms of other graphs/datasets and must be queried to produce a single graph
that can then be interpreted under RDF semantics. As with graphs, datasets are immutable. Datasets should specify the desired merge/union/disjoint
behavior of the referenced graphs/datasets. Datasets enable the partitioning of statements in separate graphs (Richard's second use case here[16]).
Allowing nested datasets enables reuse of datasets, however a dataset cannot contain itself nor can it be a member of any of it's included datasets.

In revision control, a commit "is the action of writing or merging the changes made in the working copy back to the repository. The terms 'commit' and
'checkin' can also be used in noun form to describe the new revision that is created as a result of committing. ... A revision is the state at a point
in time of the entire tree in the repository." [17] Similarly, an assertion is a resource that describes a graph or dataset including time, provenance
and other metadata such as signatures/endorsements. Assertions are linked together into a history via zero or more parent assertions. Notably,
assertions were left as a postponed item from the previous WG[4].

A g-box has been described as a mutable container[11]. This is similar to a branch in revision control. A branch provides a means of identifying a
specific set of changes over time. A branch in RDFGM is simply a dataset consisting of a set of assertions. This allows for the provenance of each
change to the underlying graph(s) to be preserved. It can be queried to produce a graph from one of its associated assertions. A g-box generally
refers to the branch head, the latest assertion in the branch. An alternative way to implement branches would be to label assertions with a URI. This
would allow for the usage of URIs to produce RDF graphs as is done in SPARQL without the URI actually 'naming' the graph.

A repository is a set of statements including metadata about those statements. This is comparable to the git object database. RDF repositories may be
cloned in a similar way.

A store is an engine that provides access to a repository. Stores parse graph literals and assign them to graphs in the repository. Stores also
provide access to statements via queries. The store is analogous to the git executable.

This is a basic outline for a revision control vocabulary. As with VCSs, stores may implement their own specific vocabularies to manage their
repositories allowing the flexibility and choice of repository design. More development and feedback from the WG/community is necessary, but I hope
this provides a new, useful perspective of the graph related issues and maybe lead to some interim resolution.

-Michael

[1] http://www.w3.org/2011/01/rdf-wg-charter
[2] http://lists.w3.org/Archives/Public/public-rdf-wg/2012Jan/0092.html
[3] http://lists.w3.org/Archives/Public/public-rdf-wg/2012Jan/0097.html
[4] http://www.w3.org/2000/03/rdf-tracking/#rdfms-assertion
[5] http://lists.w3.org/Archives/Public/public-rdf-wg/2011Aug/0124.html
[6] http://lists.w3.org/Archives/Public/public-rdf-wg/2011Apr/0432.html
[7] http://lists.w3.org/Archives/Public/public-rdf-wg/2011Dec/0062.html
[8] http://lists.w3.org/Archives/Public/public-rdf-wg/2011Feb/0130.html
[9] http://lists.w3.org/Archives/Public/public-rdf-wg/2011Dec/0049.html
[10] http://lists.w3.org/Archives/Public/public-rdf-wg/2011Mar/0750.html
[11] http://lists.w3.org/Archives/Public/public-rdf-wg/2011Feb/0092.html
[12] http://www.w3.org/TR/rdf-mt/#graphdefs
[13] http://lists.w3.org/Archives/Public/public-rdf-wg/2012Jan/0021.html
[14] http://lists.w3.org/Archives/Public/public-rdf-wg/2012Jan/0046.html
[15] http://lists.w3.org/Archives/Public/public-rdf-wg/2011Dec/0177.html
[16] http://lists.w3.org/Archives/Public/public-rdf-wg/2012Jan/0051.html
[17] http://en.wikipedia.org/wiki/Revision_control#Common_vocabulary

Received on Friday, 3 February 2012 20:26:39 UTC