- From: Seaborne, Andy <andy.seaborne@hp.com>
- Date: Wed, 29 Sep 2004 12:18:23 +0100
- To: <public-rdf-dawg@w3.org>
Inspired by the use cases we have, this proposal is an attempt to give a conceptual framework for aggregation and query. Sorry it's a bit long. -------------------------------------- Named Containers ================ This is an attempt at a conceptual model for SOURCE and FROM in SPARQL. I came across the idea in rdf-interest in an email from Bob MacGregor. The term "Named Containers of Triples" came may predate that. http://lists.w3.org/Archives/Public/www-rdf-interest/2004Aug/0225.html All mistakes are mime, the fundamental idea isn't, It does give a simple, best shot conceptual framework, for this round of a SPARQL spec; it gives space for implementations to do their own thing while providing a defined space of interoperability. "Simple" means it tries to add as little machinery over RDF as possible. It does not cover all systems out there which have implementation-specific techniques and expose the implementation to the query. I'd expect them to continue to do so. This simple framework is for interoperability. ==== Description: A query executes over data associated with the query processor in some system-dependent way. The data is a collection of named containers. Each container of triples is an RDF graph. This may be the inference closure - it is treated as a graph. The data can be viewed and accessed in two ways: 1/ All the RDF statements in all the containers can be viewed as a single RDF graph. This graph need not be realised but the access to the graph is RDF semantics - its a set of statements formed by the RDF-merge of the named graphs. 2/ An individual named graph can be accessed as an individual RDF graph. == Notes Provenance information: This framework says nothing about system information such as timestamps on graphs or other provenance information. That get into a whole infrastructure for a provenance base layer which is beyond DAWG timescale. Systems will continue to innovate in this area. It's not possible to access, say two out of five containers as a RDF-merge. Its simple to extend to that but I am worried enough about the implementation costs of dynamic (at runtime) RDF merging to not want it mandated. RDF-Merge: bNodes are made distinct by the bNodes relabelling requirement. As the bNodes labels are never revealed in a query, this is the same, for query, as assuming bNodes are all distinct. A bNode in a named container is different from any bNode in a different named container if it is not the same graph (that is, same graph, different names). A bNode in the aggregate graph is the same bNode as in the container ("same" means "query same" i.e. same value as concerns matching). ==== No SOURCE - plain (?x ?y ?z) WHERE ( ?x ?y ?z ) Where there is no SOURCE applied to a pattern, the pattern is matched against the aggregation graph - the RDF merge of all the named containers. When being SPARQL-compatible, it contains no more statements than exactly the RDF-merge. I'd expect many systems to execute in a non-compatible mode that exposes the local provenance information and other features. ==== SOURCE The SOURCE operation allows access to a named container as an RDF graph. WHERE SOURCE <uri1> ( ?x ?y ?z ) is all the triples from the named container <uri1> and no more. WHERE SOURCE ?src ( ?x ?y ?z ) In the procedural interpretation of a query, if ?src is bound then this execute the query pattern in the named container. If ?src is not bound it means execute on each container individually, with ?src bound to the URI of the container. What SOURCE does is restrict to access to the named container (not the overall RDF merge). If the triple pattern elements are RDF terms: SOURCE ?src ( :x :y "z" ) then this is asking for all named containers that have the statement :x :y "z" - that is, testing to see where a statement can be found. Incidently, SELECT DISTINCT ?x ?y ?z WHERE SOURCE ?src (?x ?y ?z) has the same results as SELECT ?x ?y ?z WHERE (?x ?y ?z) "Union query" can be achieved with: SELECT ?x ?y ?z WHERE SOURCE ?src (?x ?y ?z) Its is the concatenation of the results from querying each graph in turn. ==== FROM This is as much about "protocol" as query but its needed for the local query case where there isn't a protocol layer. FROM establishes the data for a query. How URIs of named containers get handled is up to the implementation but some systems will load URLs and files, some will attach to databases and some will do nothing much because the system environment handles getting to some collection of named containers. There is no requirement to load URLs across the web. == Case 1: "FROM <u1> <u2>" Build a data context with two named containers named <u1> and <u2>. == Case 2: "FROM <u1>" Build a data context with one named containers. Accessing the container via SOURCE and accessing the aggregation sees the same RDF graph down to the bNodes. If there is no SOURCE in the query, this is just querying the graph identified by <u1> by however the system does it. == Case 3: No FROM in query. The implementation has to set the query data context. This can be a single graph or a collection of named containers. If there is no name information, SOURCE ?src ( ?x ?y ?z ) can be either: 3a/ fail - ?src can't be bound 3b/ match as if its a single graph but ?src is not bound. Note: its not possible to create a mix of named and unnamed containers in the query data. That is intentional. Implementations may choose to allow this but there would be no test cases. Same goes for ?src being a bNode and having some vocabulary to describe the container or container graph. I'd expect the case of no FROM, and getting the query context from outside to be common in the local case. == Case 4: "FROM <u> <u>" (same URI) This highlights the case where two URIs name the same graph; in more general cases this would have to be done outside the query language FROM statement. For the same URI case, this is can go one of two ways: 4a/ Creates a data context with two named containers that do not share bNodes. It's like reading in the file twice. 4b/ Creates a data context with two named containers that name the same graph. bNodes are the same. 4c/ Make it illegal. Because the same URI is used, its possible to get indistinguishable query results - that's an argument in favour of 4c. ==== Systems == cwm: In cwm "SOURCE <u> ( ?x ?y ?z )" is: @forAll <#X>, <#Y>, <#Z> <u> log:semantics ?g . ?g log:includes { <#X>, <#Y>, <#Z> } . and "SOURCE ?u ( ?x ?y ?z )" @forAll <#X>, <#Y>, <#Z> ?u log:semantics ?g . ?g log:includes { <#X>, <#Y>, <#Z> } It has been arranged that in named containers ?g can't be returned. Jos had the form: <u>.log:semantics log:includes { <#Y> foaf:age <#Z>}} (@forAll declares document-wide variables.) In the proposal here, it also isn't possible to get the effect of ... bind ?pred to log:semantics ... ?something ?pred ?g . ?g log:includes { <#X>, <#Y>, <#Z> } so enabling static compilation of the query, and removes the need to have the query engine deal with certain predicates specially just to meet the SPARQL spec. == 3Store/RDFStore use cases http://www.ecs.soton.ac.uk/~swh/source-tests/ has the form: SOURCE ?snode (?person <foaf:name> ?name), (?snode <dc:source> ?source) None of the examples return ?snode. It provides a resource for annotations about graphs in the database. This query extract is the same as: SOURCE ?source (?person <foaf:name> ?name) in this named containers proposal, that is, named containers hides ?snode and hence there is no issue about returning it. It wraps up the use of "?snode" and "dc:source" into a single construct leaving open different implementations. In Named Containers, there is no standardised way to annotate the containers. It is not excluded, its just outside the SPARQL spec. The ?snode can be retrieved by "(?snode <dc:source> ?source)" to access system information - the constants of such are outside this proposal - but a normal query processor meeting the spec does not need to introduce a predicate like dc:source specially. I understand that 3Store has a system graph and RDFStore adds statements to the graph as it is loaded or associated with the query - there are different results for queries. Both these detailed provenance solutions are possible in this framework. == The SWAD-e SWED system http://www.swed.org.uk This is a demonstrator system as part of SWAD-e. It handles provenance by having a separate "metagraph" for each collection of named containers. Unlike 3Store, the "metagraph" does not participate in queries; instead it is explicitly accessed (and currently can't be access by a SPARQL like query). == Kowari/TKS The "from" keyword in Kowari allows the creation of a target graph through the union and intersection of sets of statements. If bNodes are kept distinct, union is RDF-merge because Kowari works on sets: the union will do the duplicate suppression (could someone confirm this please?) In addition, the "in" keyword allows a pattern to be applied to a named graph. It appears that the graph name can't be a variable. ==== Other issues: Implementation The "named containers" framework allows a range of implementation approaches; including databases and logic engines. A database with a table of quads can implement this proposal - as does a database that keeps each graph in a separate table. ==== Bells and Whistles We could have: SOURCE * (?x ?y ?z) which is the pattern applied to each container in turn. * is a just symbol for a variable not used elsewhere. SOURCE * .... SOURCE * does not match the container URIs. SOURCE SYSTEM (?x ?y ?z) Access the implementation defined environment, including all sorts of things like time, operating system version etc etc. Could also hold the metadata about the named containers. As it is implementation-specific, it should be separate from the graph of all the containers (at least in SPARQL-compatibly mode).
Received on Wednesday, 29 September 2004 11:18:57 UTC