Re: Compiling information from several different triplestores

Hi Nicolas,

On Tue, May 05, 2009 at 10:27:43PM +0900, Nicolas Raoul wrote:
> My dream is:
> 
> 1) I configure my "sparqldream" software to use dbpedia, freebase, and
> various big and frequently updated triplestores.
> 2) I run any SPARQL query on sparqldream.
> 3) sparqldream does whatever it needs to, and returns the result of my
> query, based on the most up-to-date information found in the
> configured triplestores, as if I had instantly copied all of them into
> a single local triplestore.
> 
> Does any such software exist?
> Or anything a bit similar?

I will describe what we have with http://www.cubicweb.org and let you
decide wether it is similar to your dream or not.

The CubicWeb framework is made of two parts: the data engine and the
web engine that communicate via RQL[1].

The data engine wraps data sources that can be of different types,
including SQL, LDAP, RQL, subversion, mercurial. 

Links can traverse sources' boundaries. For example a user stored in
LDAP can be linked to a document stored in subversion (this link is
stored in the primary SQL source which is required). You could then do
'Any P,D WHERE P author_of D' with the data for P and D being stored
in different sources.

A source can be another cubicweb data engine queriable via RQL. The
configuration of the source defines a "window" on the data. For
example http://www.logilab.org is our external forge. We also have an
internal forge on our intranet. This internal forge views the external
forge as a source of projects and versions. Other entities present in the
external forge do not appear in the internal forge.

Here is an excerpt of the internal forge config file named sources:
[external-forge]
adapter=pyrorql
pyro-ns-id=logilaborg
pyro-ns-host=dmzserver
mapping-file=mapping_internal_dmz.py
cubicweb-user=someuser
cubicweb-password=itspassword
base-url=http://www.logilab.org/

and the mapping_internal_dmz file:
support_entities = {'Project': True, 'Version': True, 'State': True}
support_relations = {'in_state': True, 'version_of':True}
dont_cross_relations = set(('concerns', 'done_in'))

The data engine then takes care of the rest: discovering new objects,
removing references to old objects, caching, etc. The client querying
the data engine is not aware of the sources. It can send a query like
"Any P WHERE P is Project" and get the list of all projects. Thanks to
the base-url parameter above, each project will have its canonical
url, though.

Things that are both on our todo-list and our vaporware-list at the
moment are adding other types of sources including SPARQL of course.

I stop here in order not to spam the list. Please ask for more details
if your are interested.

-
1: back when we started the cubicweb project in 2001, there was no
such thing as SPARQL. RQL is very similar to SPARQL and we are in the
process of impementing SPARQL in CubicWeb. Hopefully it will work
before summer.

-- 
Nicolas Chauvat

logilab.fr - services en informatique scientifique et gestion de connaissances  

Received on Friday, 8 May 2009 09:07:20 UTC