- From: Seaborne, Andy <andy.seaborne@hp.com>
- Date: Fri, 28 Jan 2005 19:48:58 +0000
- To: RDF Data Access Working Group <public-rdf-dawg@w3.org>
rq23 defines the term "RDF dataset" as being the thing that a query is against.
I chose the term as it isn't "knowledge base" or other term that has been
widely used. There is never a perfect choice of word/phrase but I hope this
terminology gives us the chance to define what SPARQL does with less baggage.
rq23 provides both a formal description of SPARQL/query and an informal
description through examples. They should agree although the examples can never
be complete.
Please be warned that the editors' draft (v1.181) is not fully up to date with
the F2F but I have started by restructuring and put the definition of "RDF
Dataset" in a section by itself.
http://www.w3.org/2001/sw/DataAccess/rq23/#rdfDataset
------------------------------------------------
An RDF dataset is defined as a set of background graph and a number of named graphs.
RDF dataset = { G , zero or more (<Ui>, Gi) }
G - background graph
Gi - an RDF graph
<Ui> - URI reference
------------------------------------------------
that is, a set of a graph and a number of pairs of URIref and graph. There may
be no named graphs. The background graph may be empty.
"background graph" was the agreed term from the F2F. It was called the unnamed
graph before.
In querying an RDF dataset, the pattern "(s p o)" accesses G, the background
graph, the form "GRAPH ?g (s p o)" accesses the (<Ui>, Gi)'s, and the form
"GRAPH <U> (s p o)" applies the pattern to just the graph with name <U>.
("GRAPH" was "SOURCE").
This is described in:
http://www.w3.org/2001/sw/DataAccess/rq23/#queryDataset
There is a section after that to define how the query itself can describe the
dataset. That section is what I think we are mainly discussing. It uses the
keywords from the F2F but I regard it as unfinished - I will not cover this in
this message in order to concentrate on the idea of "RDF dataset".
In the idea of "RDF dataset", there is no assumption about the setup of the dataset.
G may include the union of the Gi.
G may be disjoint from any Gi or subset of Gi.
Both are allowed - the definition of RDF dataset does not imply one way over
another; it is more general which we may be choose to restrict. The concept
"RDF dataset" does not say how the dataset is built. Therefore, it should be
able to express all the various different use cases we have.
I have avoided use of the terms "trusted" and "untrusted" to concentrate on
naming. As SteveH points out, an approach to make these orthogonal is to have a
flag associated with graphs.
Examples of RDF datasets:
Suppose we have graphs A and B:
These are some possible datasets: there are different datasets made up from A and B.
-- Dataset example 1:
A single background graph is the RDF merge of A and B. (I use "RDF merge" to be
clear that the background graph is accessed as an RDF graph. It also makes it
clear what happens about bNodes.)
The RDF dataset is { merge[A,B] } with no pairs of name/graph.
-- Dataset example 2:
A background graph that is the merge of A and B, together with access to the
graphs by names URI <u1> and URI <u2> An implementation could avoid copies but
exactly how will depend on the implementation.
The RDF dataset is { merge[A,B], (<u1>, A), (<u2>, B) }
This is the example of (s p o) accessing all triples, and further having access
to the individual named graphs.
-- Dataset example 3:
The background graph has some provenance information, graph P, about A and B.
This is the example in rq23 "GRAPH and a single, unnamed graph":
http://www.w3.org/2001/sw/DataAccess/rq23/#sourcePlainGraph
where A and B are the same URL read at different times, obtaining different graphs.
The RDF dataset is { P, (<u1>, A), (<u2>, B) }
and <u1>, <u2> are internal names : P records the mapping of internal names to
original location (the same location in the example).
-- Dataset example 4:
The background graph is empty. A plain "(s p o)" does not match - the background
graph has no triples.
The RDF dataset is { empty, (<u1>, A) (<u2>, B) }.
or omit the background graph { (<u1>, A) (<u2>, B) } -- "(s p o)" does not
match in either case.
-- Dataset example 5:
The background graph contains A, but not B. Both A and B are available as named
graphs. The RDF dataset is { A, (<u1>, A) (<u2>, B) }.
-- Dataset example 6:
The graph A is known by <u1> and also by <u3> and is in the background graph. B
is not in the dataset.
The RDF dataset is { A, (<u1>, A) (<u3>, A) }.
-- Dataset example 7:
If we have a third graph C, then we can have the RDF merge of A and C, give it a
name <u4> and still retain access to just A:
{ empty, (<u4>, merge[A, C]), (<u1>, A) , (<u2>, B) }
End of examples.
There is also the matter how, if at all, the query itself can describe a
dataset. I think this is the main area of difference and there is subtly
different terminology so I have avoided that matter here and await people's
responses to the idea of "RDF dataset".
If we have a common definition of "RDF dataset", then we can write the various
options out for query keywords in terms of the effect on the RDF dataset for the
query.
Does this definition of RDF dataset, and the relation to graph patterns, form a
basis for defining how a query might define a dataset within the SPARQL syntax?
Is this definition of RDF dataset missing anything?
Andy
Received on Friday, 28 January 2005 19:49:13 UTC