Definition of "RDF dataset"

rq23 defines the term "RDF dataset" as being the thing that a query is against. 
  I chose the term as it isn't "knowledge base" or other term that has been 
widely used.  There is never a perfect choice of word/phrase but I hope this 
terminology gives us the chance to define what SPARQL does with less baggage.

rq23 provides both a formal description of SPARQL/query and an informal 
description through examples.  They should agree although the examples can never 
be complete.


Please be warned that the editors' draft (v1.181) is not fully up to date with 
the F2F but I have started by restructuring and put the definition of "RDF 
Dataset" in a section by itself.

   http://www.w3.org/2001/sw/DataAccess/rq23/#rdfDataset

------------------------------------------------
An RDF dataset is defined as a set of background graph and a number of named graphs.

RDF dataset = { G , zero or more (<Ui>, Gi) }

   G - background graph
   Gi - an RDF graph
   <Ui> - URI reference
------------------------------------------------


that is, a set of a graph and a number of pairs of URIref and graph.  There may 
be no named graphs.  The background graph may be empty.

"background graph" was the agreed term from the F2F.  It was called the unnamed 
graph before.

In querying an RDF dataset, the pattern "(s p o)" accesses G, the background 
graph, the form "GRAPH ?g (s p o)" accesses the (<Ui>, Gi)'s, and the form 
"GRAPH <U> (s p o)" applies the pattern to just the graph with name <U>.
("GRAPH" was "SOURCE").

This is described in:

   http://www.w3.org/2001/sw/DataAccess/rq23/#queryDataset


There is a section after that to define how the query itself can describe the 
dataset.  That section is what I think we are mainly discussing.  It uses the 
keywords from the F2F but I regard it as unfinished - I will not cover this in 
this message in order to concentrate on the idea of "RDF dataset".

In the idea of "RDF dataset", there is no assumption about the setup of the dataset.

   G may include the union of the Gi.
   G may be disjoint from any Gi or subset of Gi.


Both are allowed - the definition of RDF dataset does not imply one way over 
another; it is more general which we may be choose to restrict.  The concept 
"RDF dataset" does not say how the dataset is built. Therefore, it should be 
able to express all the various different use cases we have.

I have avoided use of the terms "trusted" and "untrusted" to concentrate on 
naming.  As SteveH points out, an approach to make these orthogonal is to have a 
flag associated with graphs.

Examples of RDF datasets:

Suppose we have graphs A and B:

These are some possible datasets: there are different datasets made up from A and B.

-- Dataset example 1:

A single background graph is the RDF merge of A and B.  (I use "RDF merge" to be 
clear that the background graph is accessed as an RDF graph.  It also makes it 
clear what happens about bNodes.)

The RDF dataset is  { merge[A,B] } with no pairs of name/graph.

-- Dataset example 2:

A background graph that is the merge of A and B, together with access to the 
graphs by names URI <u1> and URI <u2>  An implementation could avoid copies but 
exactly how will depend on the implementation.

The RDF dataset is  { merge[A,B], (<u1>, A), (<u2>, B) }

This is the example of (s p o) accessing all triples, and further having access 
to the individual named graphs.

-- Dataset example 3:

The background graph has some provenance information, graph P, about A and B.
This is the example in rq23 "GRAPH and a single, unnamed graph":

   http://www.w3.org/2001/sw/DataAccess/rq23/#sourcePlainGraph

where A and B are the same URL read at different times, obtaining different graphs.

The RDF dataset is  { P, (<u1>, A), (<u2>, B) }

and <u1>, <u2> are internal names : P records the mapping of internal names to 
original location (the same location in the example).

-- Dataset example 4:

The background graph is empty. A plain "(s p o)" does not match - the background 
graph has no triples.

The RDF dataset is  { empty, (<u1>, A) (<u2>, B) }.

or omit the background graph  { (<u1>, A) (<u2>, B) } -- "(s p o)" does not 
match in either case.

-- Dataset example 5:

The background graph contains A, but not B.  Both A and B are available as named 
graphs.  The RDF dataset is { A, (<u1>, A) (<u2>, B) }.

-- Dataset example 6:

The graph A is known by <u1> and also by <u3> and is in the background graph. B 
is not in the dataset.

The RDF dataset is { A, (<u1>, A) (<u3>, A) }.

-- Dataset example 7:

If we have a third graph C, then we can have the RDF merge of A and C, give it a 
name <u4> and still retain access to just A:

{ empty, (<u4>, merge[A, C]), (<u1>, A) , (<u2>, B) }


End of examples.


There is also the matter how, if at all, the query itself can describe a 
dataset.  I think this is the main area of difference and there is subtly 
different terminology so I have avoided that matter here and await people's 
responses to the idea of "RDF dataset".

If we have a common definition of "RDF dataset", then we can write the various 
options out for query keywords in terms of the effect on the RDF dataset for the 
query.

Does this definition of RDF dataset, and the relation to graph patterns, form a 
basis for defining how a query might define a dataset within the SPARQL syntax?

Is this definition of RDF dataset missing anything?

 Andy

Received on Friday, 28 January 2005 19:49:13 UTC