Re: Named Containers : a framework for aggregation and query from Seaborne, Andy on 2004-10-12 (public-rdf-dawg@w3.org from October to December 2004)

From: Seaborne, Andy <andy.seaborne@hp.com>
Date: Tue, 12 Oct 2004 17:21:14 +0100
To: Alberto Reggiori <alberto@asemantics.com>
CC: DAWG public list <public-rdf-dawg@w3.org>
Message-ID: <416C047A.3050309@hp.com>
Some comments inline but first I'd like to reply on some overall issues:

1/ Provenance

I don't think in terms of "doing provenance".  Better is to enable it (strictly, 
enable some cases of provenance).  So, I propose there is no vocabulary for 
provenance nor any approach, implied or otherwise.

2/ Features

There are several suggestions for features.  I'd like to go for a minimal 
approach with systems doing their own thing around a common core.  I'd like to 
throw features out and make the proposal smaller.

Every feature is restricting implementation freedom in an area where there is no 
community agreement as to what the right approach is.

Every feature needs time in the WG.  A smaller proposal will get use to rec. 
faster.  As this area is all-new, it will need work to define what "it" is.  A 
valid argument is go to rec without and delay the issue to DAWG-2.  A long drawn 
out (i.e. complicated) system will not serve the community IMHO.

Just because the proposal says nothing about a feature does not mean it is 
illegal.  It is just not in the core that a client can be relying to be in every 
implementation.

3/ bNodes

If you have bNodes that are in 1-1 correspondence with graphs loaded, you might 
as well give them URIs (URNs for example).  Every time you load a graph, give 
that particular instance a URN.  Load twice, different URNs.  Then matters of 
which graph the bNodes is in, go away and you can return that information across 
the network and use the URI is a later query.


Alberto Reggiori wrote:
> 
> Andy thanks a lot for this effort trying to put some order into the 
> SOURCE story / mess :)
> 
> I like your  proposal in general and its "openness" trying not to 
> mandate "the" solution but trying to provide "a" possible solution - I 
> still feel people here having different, even if very close views, of 
> the same thing and we need some more discussion on the list...
> 
> my comments inline below...
> 
> (some questions you might be already answered in other emails - but I 
> would like to stimulate the discussion inside the WG why/what/when and 
> try to find some best practices how to deal with those SOURCEs in an 
> interoperable way)
> 
> On Sep 29, 2004, at 1:18 PM, Seaborne, Andy wrote:
> 
> 
>>Inspired by the use cases we have, this proposal is an attempt to give 
>>a
>>conceptual framework for aggregation and query.  Sorry it's a bit long.
>>
>>--------------------------------------
>>
>>==== Description:
>>
>>A query executes over data associated with the query processor in some
>>system-dependent way.  The data is a collection of named containers.
>>Each container of triples is an RDF graph.
> 
> 
> a named container is a resource I guess - a URI or can also be a bNode?

A named container is a graph.  What the collection is is (carefully!) undefined.

> 
> 
>>This may be the inference
>>closure - it is treated as a graph.
>>
>>The data can be viewed and accessed in two ways:
>>
>>1/ All the RDF statements in all the containers
> 
> 
> ...all the statements of the collection (of named containers)

It's an RDF merge.  I don;'t see the change of words changes that so I have 
missed what you are getting at here.

> 
> 
>>can be viewed as a
>>single RDF graph.  This graph need not be realised but the access to 
>>the
>>graph is RDF semantics - its a set of statements formed by the 
>>RDF-merge
>>of the named graphs.
> 
> 
> ...of the named containers (graphs).
> 
> would make sense for this virtual graph (collection of named 
> containers) to have an identifier as well?

It is not *necessary* to have an identifier.  It's the data of the query context.

> 
> E.g.
> 
> in the syntax FROM <urn:dawg:foo.com:big-collection-of-named-graphs>  
> or FROM <rdfsource://foo.com/big/collection/of/named-graphs>
> 
> or in the protocol as you like GET 
> /model?from=urn:dawg:foo.com:big-collection-of-named-graphs

To require that would imply URIs are for collections of graphs with the 
collection structure accessible.  It precludes "FROM <u1> <u2>" as a merg of two 
graphs.

I left it open.  There is collection in the context of the query execution. 
Details left to the implementation.

I think you are really suggesting recursive containers.  That seems like a 
feature that is replacing the role of "graph" with collection of named 
containers. I would like to see an argument for its necessity (not for its 
usefulness in some situations - it needs to be the usefulness in a large number 
of situations).

Basically, its a step too far.  It will take a long time to get the details 
sorted out as we have invented a new first class concept in the RDF universe.

> 
> 
>>2/ An individual named graph can be accessed as an individual RDF 
>>graph.
> 
> 
> how? explicitly by URI or triple-pattern (if bNode)?

The proposal says via the SOURCE keyword.  How URI gets to graph, it does not 
say.  That's intentional to cover many implementation.

> 
> 
>>
>>== Notes
>>
>>Provenance information:
>>
>>This framework says nothing about system information such as timestamps
>>on graphs or other provenance information.  That get into a whole
>>infrastructure for a provenance base layer which is beyond DAWG
>>timescale.  Systems will continue to innovate in this area.
> 
> 
> would such provenance information be accessible at the query level? - 
> i.e. can be used into triple-patterns ?
> 
> your SOURCE SYSTEM proposal below (if I understood it correctly) seems 
> along those lines...
> 
> in practice, for our test cases for example, where such provenance 
> information would be stored? in the manifest?
> 
> For some other SOURCE tests I am working on I trying to put such 
> provenance information it into the manifest itself referring directly 
> or indirectly to the URI of the qt:data. If there is any other triple 
> having the qt:data URI as object and predicate dc:source (or 
> log:semantics), then I will express the named container (named graph) 
> as a bNode and referenced to it by description into the query. And each 
> triple from qt:data will get that bNode as SOURCE. Just an idea....
> 
> But again, we do not have best practices how to express/markup such 
> named-graphs and provenance information (TriX is just a proposal), and 
> it might be harder to take the general case. We have some extensions 
> (again implementation specific) to include more then one graph into the 
> same file/URI - but how are other people here doing it? at API / 
> protocol level?
> 
> Last time I thought with Kendall about Rdflib it's seems they are doing 
> this programatically with API / hacks - how others are dealing with 
> this?
> 
> CWM case seems clearer due to the use of formulae and log:semantics

I did the following with cwm from CVS:

file:X.n3 -----------------------
@prefix :       <X#> .
@prefix data:   <http://example.org/data/> .
@prefix log:    <http://www.w3.org/2000/10/swap/log#> .


{ <file:D.n3> log:semantics ?g .
   ?g log:includes { data:a data:b data:c }
} => { ?g a :graph } .
-----------------------
file:D.n3 -----------------------
@prefix : <http://example.org/data/> .

:a :b :c .
-----------------------

then used "cwm X.n3 --think" to get:

which gives ( I put prefixes back and stripped out other stuff like the rule: )

     ( { data:a  data:b data:c  } ) a X:result .

?g is an N3 formula.  Formulae (quoted graphs) are outside RDF (for better or 
worse - I'm not making a judgement).


[Aside: I found that cwm no longer allows patterns in the RHS of log:includes. 
Is there no way of going that now?]

> 
> 
>>It's not possible to access, say two out of five containers as a
>>RDF-merge.  Its simple to extend to that but I am worried enough about
>>the implementation costs of dynamic (at runtime) RDF merging to not 
>>want
>>it mandated.
> 
> 
> well this is a design/implementation issue I guess - using quads this 
> can be achieved quite cheaply and not a full scan of each container is 
> needed on merge. Just "or-ing" one or more sources into your query and 
> the job is kind of easy.

It can be done but should we *require* it.  The proposal does not oblige 
implementations to provide that feature.  It's making collections of named 
containes first class objects.  Nice but do we have time?

NB. Not all systems use quads in that way (e.g. quds where the 4th slot is 
triple id.)  Not all systems are quads.

> 
> 
>>RDF-Merge:
>>
>>bNodes are made distinct by the bNodes relabelling requirement. As the
>>bNodes labels are never revealed in a query, this is the same, for
>>query, as assuming bNodes are all distinct.  A bNode in a named
>>container is different from any bNode in a different named container if
>>it is not the same graph (that is, same graph, different names).  A
>>bNode in the aggregate graph is the same bNode as in the container
>>("same" means "query same" i.e. same value as concerns matching).
> 
> 
> ok makes sense
> 
> 
>>
>>==== No SOURCE - plain (?x ?y ?z)
>>
>>    WHERE ( ?x ?y ?z )
>>
>>Where there is no SOURCE applied to a pattern, the pattern is matched
>>against the aggregation graph - the RDF merge of all the named
>>containers.  When being SPARQL-compatible, it contains no more
>>statements than exactly the RDF-merge.  I'd expect many systems to
>>execute in a non-compatible mode that exposes the local provenance
>>information and other features.
> 
> 
> many system (quads ones) will need to use SELECT DISTINCT to get same 
> effect - IMO quads are useful because one wants/needs to distinguish 
> between different "copies" of "same" triples :) and with the RDF merge 
> this benefit would be kind of lost completely if implemented as 
> specified into the MT RDF doc.

Implied DISTINCT only masks up cases where all the variables are asked for.

SELECT ?x ?y WHERE (?x ?y ?z) does not give the same number of results (assuming 
some plain store of triples with RDF semantics).

> 
> 
>>==== SOURCE
>>
>>The SOURCE operation allows access to a named container as an RDF 
>>graph.
>>
>>    WHERE SOURCE <uri1> ( ?x ?y ?z )
>>
>>is all the triples from the named container <uri1> and no more.
> 
> 
> ok URI SOURCE case
> 
> 
>>    WHERE SOURCE ?src ( ?x ?y ?z )
>>
>>In the procedural interpretation of a query, if ?src is bound then this
>>execute the query pattern in the named container.  If ?src is not bound
>>it means execute on each container individually, with ?src bound to the
>>URI of the container.
> 
> 
> how do you bind ?src then? in the protocol? or using some tricky FROM 
> (?src dawg:source <uri>) (or like other log:semantics ?)

It's a URI - rest is implementation.

> 
> 
>>What SOURCE does is restrict to access to the
>>named container (not the overall RDF merge).
> 
> 
> sure that's the idea - filter out triples to a certain space...
> 
> 
>>If the triple pattern elements are RDF terms:
>>
>>    SOURCE ?src ( :x :y "z" )
>>
>>then this is asking for all named containers that have the statement
>>:x :y "z" - that is, testing to see where a statement can be found.
> 
> 
> is this query also meant to bind the ?src variable?
> 
> 
>>Incidently,
>>
>>    SELECT DISTINCT ?x ?y ?z WHERE SOURCE ?src (?x ?y ?z)
>>
>>has the same results as
>>
>>    SELECT ?x ?y ?z WHERE (?x ?y ?z)
> 
> 
> this would require a default SELECT DISTINCT anyway for quads...
> 
> 
>>"Union query" can be achieved with:
>>
>>    SELECT ?x ?y ?z WHERE SOURCE ?src (?x ?y ?z)
>>
>>Its is the concatenation of the results from querying each graph in
>>turn.
> 
> 
> .....is ?src var bound?

Yes.

> 
> 
>>
>>==== FROM
>>
>>This is as much about "protocol" as query but its needed for the local
>>query case where there isn't a protocol layer.
> 
> 
> what is a local query? file:/// rdfstore:/// or even 
> data:application/rdf+xml;utf-8.....

Local query includes, but is not limited to, direct access to the graph e.g. 
hand a query process a programming language object.  No URI.  No protocol.

> 
> 
>>FROM establishes the data for a query.
> 
> 
> data - which data? the collection of named containers? or one of those 
> name containers? the RDF-Merge of them?

RDF merge.  There's a test case for this.

> 
> here I mean we might need a way to explicitly bind ?src (SOURCE) vars 
> and/or relate them to the FROM clause part (CWM gives some hints on 
> this as far as I understand it....)
> 
> 
>> How URIs of named containers get
>>handled is up to the implementation but some systems will load URLs and
>>files, some will attach to databases and some will do nothing much
>>because the system environment handles getting to some collection of
>>named containers.  There is no requirement to load URLs across the web.
> 
> 
> but again I am still confused, what is a "local query" ?
> 
> A local database might be as well be federated and distributed....
> 
> 
>>== Case 1: "FROM <u1> <u2>"
>>
>>Build a data context with two named containers named <u1> and <u2>.
> 
> 
> a collection of two named containers in your definition...
> 
> 
>>
>>== Case 2: "FROM <u1>"
>>
>>Build a data context with one named containers.
> 
> 
> the following applies to Case 1 and Case 2 I guess...
> 
> 
>>Accessing the container
> 
> 
> container(s)
> 
> 
>>via SOURCE and accessing the aggregation sees the same RDF graph down 
>>to
>>the bNodes. If there is
> 
> 
> what do you mean "sees the same RDF graph" ?

Same bNodes - if you ask for the property/values of such a bNode you see the 
same thing in the overall merge as in the container having the bnode.

> 
> 
>> no SOURCE in the query, this is just querying
>>the graph identified by <u1> by however the system does it.
> 
> 
> I do not understand this?
> 
> 
>>== Case 3: No FROM in query.
>>
>>The implementation has to set the query data context.  This can be a
>>single graph
> 
> 
> named container...
> 
> 
>>or a collection of named containers.
>>
>>If there is no name information, SOURCE ?src ( ?x ?y ?z ) can be 
>>either:
>>
>>3a/ fail - ?src can't be bound
>>
>>3b/ match as if its a single graph but ?src is not bound.
> 
> 
> +1
> 
> ?src = NULL / undef

-1 to NULLs : nowhere else do NULLs appear.  They act funny - can a later 
binding reset them?  What happens to earlier uses when they were NULL?  Reevaluated?

I propose a query is defined over either:

1/ a collection of named containers
2/ a graph

in particularly, no requirement for having a collection of named and unnamed 
containers.

> 
> 
>>
>>Note: its not possible to create a mix of named and unnamed containers
>>in the query data.
> 
> 
> what is an unnamed container?
> 
> 
>>That is intentional.  Implementations may choose to
>>allow this but there would be no test cases.  Same goes for ?src being 
>>a
>>bNode and having some vocabulary to describe the container or container
>>graph.
> 
> 
> ok good - you are tying to motivate the need for some implementation 
> specific stuff for this..
> 
> 
>>I'd expect the case of no FROM, and getting the query context from
>>outside to be common in the local case.
> 
> 
> well, this is the other thread I guess "do we need FROM or not ?"
> 
> 
>>
>>== Case 4: "FROM <u> <u>" (same URI)
>>
>>This highlights the case where two URIs name the same graph; in more
>>general cases this would have to be done outside the query language 
>>FROM
>>statement.
>>
>>For the same URI case, this is can go one of two ways:
>>
>>4a/ Creates a data context with two named containers that do not share
>>bNodes.  It's like reading in the file twice.
> 
> 
> +1
> 
>>4b/ Creates a data context with two named containers that name the same
>>graph.  bNodes are the same.
> 
> 
> do you mean ignore one of the two?
> 
> 
>>4c/ Make it illegal.
>>
>>Because the same URI is used, its possible to get indistinguishable
>>query results - that's an argument in favour of 4c.
>>
>>
>>==== Systems
>>
>>== cwm:
>>
>>In cwm "SOURCE <u> ( ?x ?y ?z )" is:
>>
>>    @forAll <#X>, <#Y>, <#Z>
>>    <u> log:semantics ?g .
>>    ?g log:includes { <#X>, <#Y>, <#Z> } .
>>
>>and "SOURCE ?u ( ?x ?y ?z )"
>>
>>    @forAll <#X>, <#Y>, <#Z>
>>    ?u log:semantics ?g . ?g log:includes { <#X>, <#Y>, <#Z> }
>>
>>It has been arranged that in named containers ?g can't be returned.
> 
> 
> why can not be returned? what's bad about it?
> 
> CWM seems having an elegant (to me at least) way to bind ?src variables 
> via log:semantics - why couldn't we use something similar (dawg:source) 
> ?
> 
> 
>>== 3Store/RDFStore use cases
>>
>>http://www.ecs.soton.ac.uk/~swh/source-tests/ has the form:
>>  SOURCE ?snode (?person <foaf:name> ?name),
>>       (?snode <dc:source> ?source)
>>
>>None of the examples return ?snode.
> 
> 
> that was just a choice in the specific test query - ?snode (if bNodes 
> allowed) can be returned as well in the SELECT vars part
> 
> 
>> It provides a resource for
>>annotations about graphs in the database.  This query extract is the
>>same as:
>>
>> SOURCE ?source (?person <foaf:name> ?name)
>>
>>in this named containers proposal, that is, named containers hides
>>?snode and hence there is no issue about returning it.  It wraps up the
>>use of "?snode" and "dc:source" into a single construct leaving open
>>different implementations.
> 
> 
> yes, but then you can not constraint ?snode
> 
> or is not clear to me how I could do that unless the SOURCE SYSTEM 
> construct below is meant for that...
> 
> 
>>
>>In Named Containers, there is no standardised way to annotate the
>>containers.
> 
> 
> annotate the container? you mean provenance information then?? :)
> 
> 
>> It is not excluded, its just outside the SPARQL spec.  The
>>?snode can be retrieved by "(?snode <dc:source> ?source)" to access
>>system information - the constants of such are outside this proposal -
>>but a normal query processor meeting the spec does not need to 
>>introduce
>>a predicate like dc:source specially.
> 
> 
> so the same applies to log:semantics then?
> 
> 
>>I understand that 3Store has a system graph and RDFStore adds 
>>statements
>>to the graph as it is loaded or associated with the query - there are
>>different results for queries.
> 
> 
> After first quick tests it seems 3Store and RDFStore are interoperable 
> about the SOURCE issue as far as I can tell - the problem I see is to 
> decide how to bind those ?src vars and/or how to express them into the 
> data and/or manifest file

test case in your terminology:

SELECT * WHERE
	SOURCE ?snode ( ?snode dc:source  <uri> )

because triples go into different graphs.

> 
> 
>>  Both these detailed provenance solutions
>>are possible in this framework.
> 
> 
> hope so :)
> 
> 
>>== Kowari/TKS
>>
>>The "from" keyword in Kowari allows the creation of a target graph
>>through the union and intersection of sets of statements.  If bNodes 
>>are
>>kept distinct, union is RDF-merge because Kowari works on sets: the
>>union will do the duplicate suppression (could someone confirm this
>>please?)
>>
>>In addition, the "in" keyword allows a pattern to be applied to a named
>>graph.  It appears that the graph name can't be a variable.
> 
> 
> graph names are URIs for them then?

You should ask Tom or Simon - I understand that the answer is "yes", reading 
kowari.org.

> 
> 
>>==== Bells and Whistles
>>
>>We could have:
>>
>>    SOURCE * (?x ?y ?z)
>>
>>which is the pattern applied to each container in turn. * is a just
>>symbol for a variable not used elsewhere. SOURCE * .... SOURCE * does
>>not match the container URIs.
>>
>>   SOURCE SYSTEM (?x ?y ?z)
>>
>>Access the implementation defined environment, including all sorts of
>>things like time, operating system version etc etc.
> 
> 
> is this the trick we (me/SteveH/CWM people) could use to explicitly 
> bind ?src vars to the URI of the SOURCE?
> 
> 
>> Could also hold the
>>metadata about the named containers.  As it is implementation-specific,
>>it should be separate from the graph of all the containers (at least in
>>SPARQL-compatibly mode).
> 
> 
> do you mean things like this?
> 
> SOURCE SYSTEM (?snode dawg:source ?source)
> 
> or for cwm
> 
> SOURCE SYSTEM (?snode log:semantics ?source)
> 
> 
> anyway, good proposal - let's hope now we can get some more detailed 
> discussion going on in the WG about this important issue
> 
> thanks again Andy
> 
> Yours
> 
> Alberto
> 
>
Received on Tuesday, 12 October 2004 16:21:56 UTC