Named Containers : a framework for aggregation and query from Seaborne, Andy on 2004-09-29 (public-rdf-dawg@w3.org from July to September 2004)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Wed, 29 Sep 2004 12:18:23 +0100
To: <public-rdf-dawg@w3.org>
Message-ID: <8D5B24B83C6A2E4B9E7EE5FA82627DC93967F0@sdcexcea01.emea.cpqcorp.net>
Inspired by the use cases we have, this proposal is an attempt to give a
conceptual framework for aggregation and query.  Sorry it's a bit long.

--------------------------------------

Named Containers
================

This is an attempt at a conceptual model for SOURCE and FROM in SPARQL.
I came across the idea in rdf-interest in an email from Bob MacGregor.
The term "Named Containers of Triples" came may predate that.

http://lists.w3.org/Archives/Public/www-rdf-interest/2004Aug/0225.html

All mistakes are mime, the fundamental idea isn't,  It does give a
simple, best shot conceptual framework, for this round of a SPARQL spec;
it gives space for implementations to do their own thing while providing
a defined space of interoperability.

"Simple" means it tries to add as little machinery over RDF as possible.
It does not cover all systems out there which have
implementation-specific techniques and expose the implementation to the
query.  I'd expect them to continue to do so.  This simple framework is
for interoperability.

==== Description:

A query executes over data associated with the query processor in some
system-dependent way.  The data is a collection of named containers.
Each container of triples is an RDF graph.  This may be the inference
closure - it is treated as a graph.

The data can be viewed and accessed in two ways:

1/ All the RDF statements in all the containers can be viewed as a
single RDF graph.  This graph need not be realised but the access to the
graph is RDF semantics - its a set of statements formed by the RDF-merge
of the named graphs.

2/ An individual named graph can be accessed as an individual RDF graph.


== Notes

Provenance information:

This framework says nothing about system information such as timestamps
on graphs or other provenance information.  That get into a whole
infrastructure for a provenance base layer which is beyond DAWG
timescale.  Systems will continue to innovate in this area.

It's not possible to access, say two out of five containers as a
RDF-merge.  Its simple to extend to that but I am worried enough about
the implementation costs of dynamic (at runtime) RDF merging to not want
it mandated.

RDF-Merge:

bNodes are made distinct by the bNodes relabelling requirement. As the
bNodes labels are never revealed in a query, this is the same, for
query, as assuming bNodes are all distinct.  A bNode in a named
container is different from any bNode in a different named container if
it is not the same graph (that is, same graph, different names).  A
bNode in the aggregate graph is the same bNode as in the container
("same" means "query same" i.e. same value as concerns matching).


==== No SOURCE - plain (?x ?y ?z)

    WHERE ( ?x ?y ?z ) 

Where there is no SOURCE applied to a pattern, the pattern is matched
against the aggregation graph - the RDF merge of all the named
containers.  When being SPARQL-compatible, it contains no more
statements than exactly the RDF-merge.  I'd expect many systems to
execute in a non-compatible mode that exposes the local provenance
information and other features.


==== SOURCE

The SOURCE operation allows access to a named container as an RDF graph.

    WHERE SOURCE <uri1> ( ?x ?y ?z )

is all the triples from the named container <uri1> and no more.

    WHERE SOURCE ?src ( ?x ?y ?z )

In the procedural interpretation of a query, if ?src is bound then this
execute the query pattern in the named container.  If ?src is not bound
it means execute on each container individually, with ?src bound to the
URI of the container.  What SOURCE does is restrict to access to the
named container (not the overall RDF merge).

If the triple pattern elements are RDF terms:

    SOURCE ?src ( :x :y "z" )
    
then this is asking for all named containers that have the statement
:x :y "z" - that is, testing to see where a statement can be found.

Incidently,

    SELECT DISTINCT ?x ?y ?z WHERE SOURCE ?src (?x ?y ?z)

has the same results as

    SELECT ?x ?y ?z WHERE (?x ?y ?z)

"Union query" can be achieved with:

    SELECT ?x ?y ?z WHERE SOURCE ?src (?x ?y ?z)
  
Its is the concatenation of the results from querying each graph in
turn.


==== FROM

This is as much about "protocol" as query but its needed for the local
query case where there isn't a protocol layer.

FROM establishes the data for a query.  How URIs of named containers get
handled is up to the implementation but some systems will load URLs and
files, some will attach to databases and some will do nothing much
because the system environment handles getting to some collection of
named containers.  There is no requirement to load URLs across the web.

== Case 1: "FROM <u1> <u2>"

Build a data context with two named containers named <u1> and <u2>.


== Case 2: "FROM <u1>"

Build a data context with one named containers.  Accessing the container
via SOURCE and accessing the aggregation sees the same RDF graph down to
the bNodes. If there is no SOURCE in the query, this is just querying
the graph identified by <u1> by however the system does it.


== Case 3: No FROM in query.

The implementation has to set the query data context.  This can be a
single graph or a collection of named containers.

If there is no name information, SOURCE ?src ( ?x ?y ?z ) can be either:

3a/ fail - ?src can't be bound

3b/ match as if its a single graph but ?src is not bound.


Note: its not possible to create a mix of named and unnamed containers
in the query data.  That is intentional.  Implementations may choose to
allow this but there would be no test cases.  Same goes for ?src being a
bNode and having some vocabulary to describe the container or container
graph.

I'd expect the case of no FROM, and getting the query context from
outside to be common in the local case.


== Case 4: "FROM <u> <u>" (same URI)

This highlights the case where two URIs name the same graph; in more
general cases this would have to be done outside the query language FROM
statement.

For the same URI case, this is can go one of two ways:

4a/ Creates a data context with two named containers that do not share
bNodes.  It's like reading in the file twice. 

4b/ Creates a data context with two named containers that name the same
graph.  bNodes are the same. 

4c/ Make it illegal.

Because the same URI is used, its possible to get indistinguishable
query results - that's an argument in favour of 4c.


==== Systems

== cwm: 

In cwm "SOURCE <u> ( ?x ?y ?z )" is:

    @forAll <#X>, <#Y>, <#Z>
    <u> log:semantics ?g .
    ?g log:includes { <#X>, <#Y>, <#Z> } .
  
and "SOURCE ?u ( ?x ?y ?z )"

    @forAll <#X>, <#Y>, <#Z>
    ?u log:semantics ?g . ?g log:includes { <#X>, <#Y>, <#Z> }

It has been arranged that in named containers ?g can't be returned.  Jos
had the form:

    <u>.log:semantics log:includes { <#Y> foaf:age <#Z>}}

(@forAll declares document-wide variables.)

In the proposal here, it also isn't possible to get the effect of 

    ... bind ?pred to log:semantics ...
    ?something ?pred ?g . ?g log:includes { <#X>, <#Y>, <#Z> }

so enabling static compilation of the query, and removes the need to
have the query engine deal with certain predicates specially just to
meet the SPARQL spec.


== 3Store/RDFStore use cases

http://www.ecs.soton.ac.uk/~swh/source-tests/ has the form:
  SOURCE ?snode (?person <foaf:name> ?name),
       (?snode <dc:source> ?source)

None of the examples return ?snode.  It provides a resource for
annotations about graphs in the database.  This query extract is the
same as:

 SOURCE ?source (?person <foaf:name> ?name) 
 
in this named containers proposal, that is, named containers hides
?snode and hence there is no issue about returning it.  It wraps up the
use of "?snode" and "dc:source" into a single construct leaving open
different implementations.


In Named Containers, there is no standardised way to annotate the
containers.  It is not excluded, its just outside the SPARQL spec.  The
?snode can be retrieved by "(?snode <dc:source> ?source)" to access
system information - the constants of such are outside this proposal -
but a normal query processor meeting the spec does not need to introduce
a predicate like dc:source specially.

I understand that 3Store has a system graph and RDFStore adds statements
to the graph as it is loaded or associated with the query - there are
different results for queries.  Both these detailed provenance solutions
are possible in this framework.


== The SWAD-e SWED system

http://www.swed.org.uk

This is a demonstrator system as part of SWAD-e.  It handles provenance
by having a separate "metagraph" for each collection of named
containers.  Unlike 3Store, the "metagraph" does not participate in
queries; instead it is explicitly accessed (and currently can't be
access by a SPARQL like query).


== Kowari/TKS

The "from" keyword in Kowari allows the creation of a target graph
through the union and intersection of sets of statements.  If bNodes are
kept distinct, union is RDF-merge because Kowari works on sets: the
union will do the duplicate suppression (could someone confirm this
please?)

In addition, the "in" keyword allows a pattern to be applied to a named
graph.  It appears that the graph name can't be a variable.


==== Other issues: Implementation

The "named containers" framework allows a range of implementation
approaches; including databases and logic engines.

A database with a table of quads can implement this proposal - as does a
database that keeps each graph in a separate table.

==== Bells and Whistles

We could have:

    SOURCE * (?x ?y ?z) 

which is the pattern applied to each container in turn. * is a just
symbol for a variable not used elsewhere. SOURCE * .... SOURCE * does
not match the container URIs.

   SOURCE SYSTEM (?x ?y ?z)
   
Access the implementation defined environment, including all sorts of
things like time, operating system version etc etc.  Could also hold the
metadata about the named containers.  As it is implementation-specific,
it should be separate from the graph of all the containers (at least in
SPARQL-compatibly mode).
Received on Wednesday, 29 September 2004 11:18:57 UTC