Re: Named Containers : a framework for aggregation and query from Alberto Reggiori on 2004-10-05 (public-rdf-dawg@w3.org from October to December 2004)

From: Alberto Reggiori <alberto@asemantics.com>
Date: Tue, 5 Oct 2004 12:08:12 +0200
To: Andy Seaborne <andy.seaborne@hp.com>
Cc: DAWG public list <public-rdf-dawg@w3.org>
Message-Id: <7770F28E-16B6-11D9-B5B4-0003939CA324@asemantics.com>
Andy thanks a lot for this effort trying to put some order into the 
SOURCE story / mess :)

I like your  proposal in general and its "openness" trying not to 
mandate "the" solution but trying to provide "a" possible solution - I 
still feel people here having different, even if very close views, of 
the same thing and we need some more discussion on the list...

my comments inline below...

(some questions you might be already answered in other emails - but I 
would like to stimulate the discussion inside the WG why/what/when and 
try to find some best practices how to deal with those SOURCEs in an 
interoperable way)

On Sep 29, 2004, at 1:18 PM, Seaborne, Andy wrote:

>
> Inspired by the use cases we have, this proposal is an attempt to give 
> a
> conceptual framework for aggregation and query.  Sorry it's a bit long.
>
> --------------------------------------
>
> ==== Description:
>
> A query executes over data associated with the query processor in some
> system-dependent way.  The data is a collection of named containers.
> Each container of triples is an RDF graph.

a named container is a resource I guess - a URI or can also be a bNode?

> This may be the inference
> closure - it is treated as a graph.
>
> The data can be viewed and accessed in two ways:
>
> 1/ All the RDF statements in all the containers

...all the statements of the collection (of named containers)

> can be viewed as a
> single RDF graph.  This graph need not be realised but the access to 
> the
> graph is RDF semantics - its a set of statements formed by the 
> RDF-merge
> of the named graphs.

...of the named containers (graphs).

would make sense for this virtual graph (collection of named 
containers) to have an identifier as well?

E.g.

in the syntax FROM <urn:dawg:foo.com:big-collection-of-named-graphs>  
or FROM <rdfsource://foo.com/big/collection/of/named-graphs>

or in the protocol as you like GET 
/model?from=urn:dawg:foo.com:big-collection-of-named-graphs

>
> 2/ An individual named graph can be accessed as an individual RDF 
> graph.

how? explicitly by URI or triple-pattern (if bNode)?

>
>
> == Notes
>
> Provenance information:
>
> This framework says nothing about system information such as timestamps
> on graphs or other provenance information.  That get into a whole
> infrastructure for a provenance base layer which is beyond DAWG
> timescale.  Systems will continue to innovate in this area.

would such provenance information be accessible at the query level? - 
i.e. can be used into triple-patterns ?

your SOURCE SYSTEM proposal below (if I understood it correctly) seems 
along those lines...

in practice, for our test cases for example, where such provenance 
information would be stored? in the manifest?

For some other SOURCE tests I am working on I trying to put such 
provenance information it into the manifest itself referring directly 
or indirectly to the URI of the qt:data. If there is any other triple 
having the qt:data URI as object and predicate dc:source (or 
log:semantics), then I will express the named container (named graph) 
as a bNode and referenced to it by description into the query. And each 
triple from qt:data will get that bNode as SOURCE. Just an idea....

But again, we do not have best practices how to express/markup such 
named-graphs and provenance information (TriX is just a proposal), and 
it might be harder to take the general case. We have some extensions 
(again implementation specific) to include more then one graph into the 
same file/URI - but how are other people here doing it? at API / 
protocol level?

Last time I thought with Kendall about Rdflib it's seems they are doing 
this programatically with API / hacks - how others are dealing with 
this?

CWM case seems clearer due to the use of formulae and log:semantics

>
> It's not possible to access, say two out of five containers as a
> RDF-merge.  Its simple to extend to that but I am worried enough about
> the implementation costs of dynamic (at runtime) RDF merging to not 
> want
> it mandated.

well this is a design/implementation issue I guess - using quads this 
can be achieved quite cheaply and not a full scan of each container is 
needed on merge. Just "or-ing" one or more sources into your query and 
the job is kind of easy.

>
> RDF-Merge:
>
> bNodes are made distinct by the bNodes relabelling requirement. As the
> bNodes labels are never revealed in a query, this is the same, for
> query, as assuming bNodes are all distinct.  A bNode in a named
> container is different from any bNode in a different named container if
> it is not the same graph (that is, same graph, different names).  A
> bNode in the aggregate graph is the same bNode as in the container
> ("same" means "query same" i.e. same value as concerns matching).

ok makes sense

>
>
> ==== No SOURCE - plain (?x ?y ?z)
>
>     WHERE ( ?x ?y ?z )
>
> Where there is no SOURCE applied to a pattern, the pattern is matched
> against the aggregation graph - the RDF merge of all the named
> containers.  When being SPARQL-compatible, it contains no more
> statements than exactly the RDF-merge.  I'd expect many systems to
> execute in a non-compatible mode that exposes the local provenance
> information and other features.

many system (quads ones) will need to use SELECT DISTINCT to get same 
effect - IMO quads are useful because one wants/needs to distinguish 
between different "copies" of "same" triples :) and with the RDF merge 
this benefit would be kind of lost completely if implemented as 
specified into the MT RDF doc.

>
> ==== SOURCE
>
> The SOURCE operation allows access to a named container as an RDF 
> graph.
>
>     WHERE SOURCE <uri1> ( ?x ?y ?z )
>
> is all the triples from the named container <uri1> and no more.

ok URI SOURCE case

>
>     WHERE SOURCE ?src ( ?x ?y ?z )
>
> In the procedural interpretation of a query, if ?src is bound then this
> execute the query pattern in the named container.  If ?src is not bound
> it means execute on each container individually, with ?src bound to the
> URI of the container.

how do you bind ?src then? in the protocol? or using some tricky FROM 
(?src dawg:source <uri>) (or like other log:semantics ?)

> What SOURCE does is restrict to access to the
> named container (not the overall RDF merge).

sure that's the idea - filter out triples to a certain space...

>
> If the triple pattern elements are RDF terms:
>
>     SOURCE ?src ( :x :y "z" )
>
> then this is asking for all named containers that have the statement
> :x :y "z" - that is, testing to see where a statement can be found.

is this query also meant to bind the ?src variable?

>
> Incidently,
>
>     SELECT DISTINCT ?x ?y ?z WHERE SOURCE ?src (?x ?y ?z)
>
> has the same results as
>
>     SELECT ?x ?y ?z WHERE (?x ?y ?z)

this would require a default SELECT DISTINCT anyway for quads...

>
> "Union query" can be achieved with:
>
>     SELECT ?x ?y ?z WHERE SOURCE ?src (?x ?y ?z)
>
> Its is the concatenation of the results from querying each graph in
> turn.

.....is ?src var bound?

>
>
> ==== FROM
>
> This is as much about "protocol" as query but its needed for the local
> query case where there isn't a protocol layer.

what is a local query? file:/// rdfstore:/// or even 
data:application/rdf+xml;utf-8.....

>
> FROM establishes the data for a query.

data - which data? the collection of named containers? or one of those 
name containers? the RDF-Merge of them?

here I mean we might need a way to explicitly bind ?src (SOURCE) vars 
and/or relate them to the FROM clause part (CWM gives some hints on 
this as far as I understand it....)

>  How URIs of named containers get
> handled is up to the implementation but some systems will load URLs and
> files, some will attach to databases and some will do nothing much
> because the system environment handles getting to some collection of
> named containers.  There is no requirement to load URLs across the web.

but again I am still confused, what is a "local query" ?

A local database might be as well be federated and distributed....

>
> == Case 1: "FROM <u1> <u2>"
>
> Build a data context with two named containers named <u1> and <u2>.

a collection of two named containers in your definition...

>
>
> == Case 2: "FROM <u1>"
>
> Build a data context with one named containers.

the following applies to Case 1 and Case 2 I guess...

> Accessing the container

container(s)

> via SOURCE and accessing the aggregation sees the same RDF graph down 
> to
> the bNodes. If there is

what do you mean "sees the same RDF graph" ?

>  no SOURCE in the query, this is just querying
> the graph identified by <u1> by however the system does it.

I do not understand this?

>
> == Case 3: No FROM in query.
>
> The implementation has to set the query data context.  This can be a
> single graph

named container...

> or a collection of named containers.
>
> If there is no name information, SOURCE ?src ( ?x ?y ?z ) can be 
> either:
>
> 3a/ fail - ?src can't be bound
>
> 3b/ match as if its a single graph but ?src is not bound.

+1

?src = NULL / undef

>
>
> Note: its not possible to create a mix of named and unnamed containers
> in the query data.

what is an unnamed container?

> That is intentional.  Implementations may choose to
> allow this but there would be no test cases.  Same goes for ?src being 
> a
> bNode and having some vocabulary to describe the container or container
> graph.

ok good - you are tying to motivate the need for some implementation 
specific stuff for this..

>
> I'd expect the case of no FROM, and getting the query context from
> outside to be common in the local case.

well, this is the other thread I guess "do we need FROM or not ?"

>
>
> == Case 4: "FROM <u> <u>" (same URI)
>
> This highlights the case where two URIs name the same graph; in more
> general cases this would have to be done outside the query language 
> FROM
> statement.
>
> For the same URI case, this is can go one of two ways:
>
> 4a/ Creates a data context with two named containers that do not share
> bNodes.  It's like reading in the file twice.

+1

>
> 4b/ Creates a data context with two named containers that name the same
> graph.  bNodes are the same.

do you mean ignore one of the two?

>
> 4c/ Make it illegal.
>
> Because the same URI is used, its possible to get indistinguishable
> query results - that's an argument in favour of 4c.
>
>
> ==== Systems
>
> == cwm:
>
> In cwm "SOURCE <u> ( ?x ?y ?z )" is:
>
>     @forAll <#X>, <#Y>, <#Z>
>     <u> log:semantics ?g .
>     ?g log:includes { <#X>, <#Y>, <#Z> } .
>
> and "SOURCE ?u ( ?x ?y ?z )"
>
>     @forAll <#X>, <#Y>, <#Z>
>     ?u log:semantics ?g . ?g log:includes { <#X>, <#Y>, <#Z> }
>
> It has been arranged that in named containers ?g can't be returned.

why can not be returned? what's bad about it?

CWM seems having an elegant (to me at least) way to bind ?src variables 
via log:semantics - why couldn't we use something similar (dawg:source) 
?

>
> == 3Store/RDFStore use cases
>
> http://www.ecs.soton.ac.uk/~swh/source-tests/ has the form:
>   SOURCE ?snode (?person <foaf:name> ?name),
>        (?snode <dc:source> ?source)
>
> None of the examples return ?snode.

that was just a choice in the specific test query - ?snode (if bNodes 
allowed) can be returned as well in the SELECT vars part

>  It provides a resource for
> annotations about graphs in the database.  This query extract is the
> same as:
>
>  SOURCE ?source (?person <foaf:name> ?name)
>
> in this named containers proposal, that is, named containers hides
> ?snode and hence there is no issue about returning it.  It wraps up the
> use of "?snode" and "dc:source" into a single construct leaving open
> different implementations.

yes, but then you can not constraint ?snode

or is not clear to me how I could do that unless the SOURCE SYSTEM 
construct below is meant for that...

>
>
> In Named Containers, there is no standardised way to annotate the
> containers.

annotate the container? you mean provenance information then?? :)

>  It is not excluded, its just outside the SPARQL spec.  The
> ?snode can be retrieved by "(?snode <dc:source> ?source)" to access
> system information - the constants of such are outside this proposal -
> but a normal query processor meeting the spec does not need to 
> introduce
> a predicate like dc:source specially.

so the same applies to log:semantics then?

>
> I understand that 3Store has a system graph and RDFStore adds 
> statements
> to the graph as it is loaded or associated with the query - there are
> different results for queries.

After first quick tests it seems 3Store and RDFStore are interoperable 
about the SOURCE issue as far as I can tell - the problem I see is to 
decide how to bind those ?src vars and/or how to express them into the 
data and/or manifest file

>   Both these detailed provenance solutions
> are possible in this framework.

hope so :)

> == Kowari/TKS
>
> The "from" keyword in Kowari allows the creation of a target graph
> through the union and intersection of sets of statements.  If bNodes 
> are
> kept distinct, union is RDF-merge because Kowari works on sets: the
> union will do the duplicate suppression (could someone confirm this
> please?)
>
> In addition, the "in" keyword allows a pattern to be applied to a named
> graph.  It appears that the graph name can't be a variable.

graph names are URIs for them then?

> ==== Bells and Whistles
>
> We could have:
>
>     SOURCE * (?x ?y ?z)
>
> which is the pattern applied to each container in turn. * is a just
> symbol for a variable not used elsewhere. SOURCE * .... SOURCE * does
> not match the container URIs.
>
>    SOURCE SYSTEM (?x ?y ?z)
>
> Access the implementation defined environment, including all sorts of
> things like time, operating system version etc etc.

is this the trick we (me/SteveH/CWM people) could use to explicitly 
bind ?src vars to the URI of the SOURCE?

>  Could also hold the
> metadata about the named containers.  As it is implementation-specific,
> it should be separate from the graph of all the containers (at least in
> SPARQL-compatibly mode).

do you mean things like this?

SOURCE SYSTEM (?snode dawg:source ?source)

or for cwm

SOURCE SYSTEM (?snode log:semantics ?source)


anyway, good proposal - let's hope now we can get some more detailed 
discussion going on in the WG about this important issue

thanks again Andy

Yours

Alberto
Received on Tuesday, 5 October 2004 10:08:13 UTC