Re: Definition of "RDF dataset" from Seaborne, Andy on 2005-02-01 (public-rdf-dawg@w3.org from January to March 2005)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Tue, 01 Feb 2005 21:17:24 +0000
To: Alberto Reggiori <alberto@asemantics.com>
Cc: RDF Data Access Working Group <public-rdf-dawg@w3.org>
Message-ID: <41FFF1E4.4080104@hp.com>
Alberto Reggiori wrote:
> 
> 
> Andy,
> 
> We find this proposal appealing and it seems covering all our use-cases 
> for source/provenance/context we have been dealing with (and not 
> considering the named-graph possibly identified by a bNode).
> 
> The distinction between the background graph and named-graphs is a good 
> one.
> 
> We would rather leave the trust/untrust terminology on some higher level 
> RDF vocabulary to be wrapped around the model you are proposing.

Agreed - the area of "trust" is much wider than query and the requirements 
it would place on query aren't clear to me today.  Therefore, viewing it as 
a higher level seems appropriate at the moment.  I now see that there are 
some common use cases but it is too early propose an approach that 
restricts, or makes inconvenient, other approaches while there is still so 
much experimentation to be done.

> 
> About the syntax proposed in the current spec (directly derived from the 
> f2f discussion) is good to keep the distinction between querying the 
> RDF-dataset and describing it into the query itself. Even though we find 
> the term LOAD a bit misleading (e.g. it would map for an implementation 
> to a DBI/JDBC connect method) and we keep on liking FROM. Perhaps 
> qualified with some AS/NAMED keyword instead of distinguish between LOAD 
> and FROM. We think that the FROM keyword as proposed is just a 
> "modifier" of the actual LOAD (which could trigger implementation 
> specific things to deal with named-graphs). So we would propose 
> s/LOAD/FROM/ s/FROM/FROM NAMED/ or s/FROM/FROM AS/

I hope splitting out the sections into dataset, GRAPH and directives means 
we can not get into a big interconnected decision space.

DaveB suggested s/LOAD/WITH/ to remove an implication of a permanent change 
to the RDF store beyond the lifetime of the query.

> 
> To conclude, independently from the purely syntactic discussion, we 
> support the proposal

Great.

There are still the syntax issues around LOAD/FROM/WITH/... Section "9 
Specifying RDF Datasets" is awaiting a reworking in the light of discussions 
and a sense of whether sections 7 (dataset) and 8 (GRAPH) work for people. 
It may even be better to say nothing about these directives, rather than 
have a set that favours one trust approach over another.

I think an important factor will be the relationship to the protocol.  It's 
been observed before that this "protocol" (in the widest sense to include 
API).  The service description may help (it's too early to judge) -- or it 
may be just a tempting place to put tricky issues!

I'm glad you can see how to get your use cases built on the basic conceptual 
model of the RDF dataset.

	Andy

> 
> Yours
> 
> Alberto
> 
> On Jan 28, 2005, at 8:48 PM, Seaborne, Andy wrote:
> 
>>
>> rq23 defines the term "RDF dataset" as being the thing that a query is 
>> against.  I chose the term as it isn't "knowledge base" or other term 
>> that has been widely used.  There is never a perfect choice of 
>> word/phrase but I hope this terminology gives us the chance to define 
>> what SPARQL does with less baggage.
>>
>> rq23 provides both a formal description of SPARQL/query and an 
>> informal description through examples.  They should agree although the 
>> examples can never be complete.
>>
>>
>> Please be warned that the editors' draft (v1.181) is not fully up to 
>> date with the F2F but I have started by restructuring and put the 
>> definition of "RDF Dataset" in a section by itself.
>>
>>   http://www.w3.org/2001/sw/DataAccess/rq23/#rdfDataset
>>
>> ------------------------------------------------
>> An RDF dataset is defined as a set of background graph and a number of 
>> named graphs.
>>
>> RDF dataset = { G , zero or more (<Ui>, Gi) }
>>
>>   G - background graph
>>   Gi - an RDF graph
>>   <Ui> - URI reference
>> ------------------------------------------------
>>
>>
>> that is, a set of a graph and a number of pairs of URIref and graph.  
>> There may be no named graphs.  The background graph may be empty.
>>
>> "background graph" was the agreed term from the F2F.  It was called 
>> the unnamed graph before.
>>
>> In querying an RDF dataset, the pattern "(s p o)" accesses G, the 
>> background graph, the form "GRAPH ?g (s p o)" accesses the (<Ui>, 
>> Gi)'s, and the form "GRAPH <U> (s p o)" applies the pattern to just 
>> the graph with name <U>.
>> ("GRAPH" was "SOURCE").
>>
>> This is described in:
>>
>>   http://www.w3.org/2001/sw/DataAccess/rq23/#queryDataset
>>
>>
>> There is a section after that to define how the query itself can 
>> describe the dataset.  That section is what I think we are mainly 
>> discussing.  It uses the keywords from the F2F but I regard it as 
>> unfinished - I will not cover this in this message in order to 
>> concentrate on the idea of "RDF dataset".
>>
>> In the idea of "RDF dataset", there is no assumption about the setup 
>> of the dataset.
>>
>>   G may include the union of the Gi.
>>   G may be disjoint from any Gi or subset of Gi.
>>
>>
>> Both are allowed - the definition of RDF dataset does not imply one 
>> way over another; it is more general which we may be choose to 
>> restrict.  The concept "RDF dataset" does not say how the dataset is 
>> built. Therefore, it should be able to express all the various 
>> different use cases we have.
>>
>> I have avoided use of the terms "trusted" and "untrusted" to 
>> concentrate on naming.  As SteveH points out, an approach to make 
>> these orthogonal is to have a flag associated with graphs.
>>
>> Examples of RDF datasets:
>>
>> Suppose we have graphs A and B:
>>
>> These are some possible datasets: there are different datasets made up 
>> from A and B.
>>
>> -- Dataset example 1:
>>
>> A single background graph is the RDF merge of A and B.  (I use "RDF 
>> merge" to be clear that the background graph is accessed as an RDF 
>> graph.  It also makes it clear what happens about bNodes.)
>>
>> The RDF dataset is  { merge[A,B] } with no pairs of name/graph.
>>
>> -- Dataset example 2:
>>
>> A background graph that is the merge of A and B, together with access 
>> to the graphs by names URI <u1> and URI <u2>  An implementation could 
>> avoid copies but exactly how will depend on the implementation.
>>
>> The RDF dataset is  { merge[A,B], (<u1>, A), (<u2>, B) }
>>
>> This is the example of (s p o) accessing all triples, and further 
>> having access to the individual named graphs.
>>
>> -- Dataset example 3:
>>
>> The background graph has some provenance information, graph P, about A 
>> and B.
>> This is the example in rq23 "GRAPH and a single, unnamed graph":
>>
>>   http://www.w3.org/2001/sw/DataAccess/rq23/#sourcePlainGraph
>>
>> where A and B are the same URL read at different times, obtaining 
>> different graphs.
>>
>> The RDF dataset is  { P, (<u1>, A), (<u2>, B) }
>>
>> and <u1>, <u2> are internal names : P records the mapping of internal 
>> names to original location (the same location in the example).
>>
>> -- Dataset example 4:
>>
>> The background graph is empty. A plain "(s p o)" does not match - the 
>> background graph has no triples.
>>
>> The RDF dataset is  { empty, (<u1>, A) (<u2>, B) }.
>>
>> or omit the background graph  { (<u1>, A) (<u2>, B) } -- "(s p o)" 
>> does not match in either case.
>>
>> -- Dataset example 5:
>>
>> The background graph contains A, but not B.  Both A and B are 
>> available as named graphs.  The RDF dataset is { A, (<u1>, A) (<u2>, 
>> B) }.
>>
>> -- Dataset example 6:
>>
>> The graph A is known by <u1> and also by <u3> and is in the background 
>> graph. B is not in the dataset.
>>
>> The RDF dataset is { A, (<u1>, A) (<u3>, A) }.
>>
>> -- Dataset example 7:
>>
>> If we have a third graph C, then we can have the RDF merge of A and C, 
>> give it a name <u4> and still retain access to just A:
>>
>> { empty, (<u4>, merge[A, C]), (<u1>, A) , (<u2>, B) }
>>
>>
>> End of examples.
>>
>>
>> There is also the matter how, if at all, the query itself can describe 
>> a dataset.  I think this is the main area of difference and there is 
>> subtly different terminology so I have avoided that matter here and 
>> await people's responses to the idea of "RDF dataset".
>>
>> If we have a common definition of "RDF dataset", then we can write the 
>> various options out for query keywords in terms of the effect on the 
>> RDF dataset for the query.
>>
>> Does this definition of RDF dataset, and the relation to graph 
>> patterns, form a basis for defining how a query might define a dataset 
>> within the SPARQL syntax?
>>
>> Is this definition of RDF dataset missing anything?
>>
>>     Andy
>>
>>
> -
> Alberto Reggiori, @Semantics S.R.L.
> www.asemantics.com
> 
>
Received on Tuesday, 1 February 2005 21:18:00 UTC