Re: resources for network-based/hierarchical RDF store from Bradley Allen on 2007-04-30 (semantic-web@w3.org from April 2007)

From: Bradley Allen <ballen@siderean.com>
Date: Mon, 30 Apr 2007 12:34:38 -0700
To: Danny Ayers <danny.ayers@gmail.com>
CC: <semantic-web@w3.org>
Message-ID: <C25B92DE.77F0%ballen@siderean.com>

I thought I'd chime in on this discussion in the light of our announcement
today about breaking the billion-quad barrier in a pilot with Elsevier
(http://www.siderean.com/newsitem.aspx?pid=24) and add some additional gloss
to the information in that press release.

The benchmark we did with Elsevier was performed on a
hierarchically-clustered grid of 32 commodity Linux boxes, each running an
instance of Seamark Navigator. The RDF represented the bibliographical
information describing 40 million articles plus 10 million descriptions of
authors. The application was an end-user relational navigation interface
over the collection of articles and authors.

The principal difference between this approach and those in some of the
other large RDF stores discussed in this thread is the design emphasis on
sub-second query response under load for a relational navigation
application. In this type of application, a query is effectively equivalent
to several tens of SPARQL queries together with aggregate operators that are
returning facet value counts for selected attributes of matching resources,
along with additional queries to retrieve humanly-readable labels. That
being said, the cluster also admits of SPARQL queries against the RDF graph
in the manner of most other stores.

The RDF quads are automatically partitioned across the cluster on the basis
of rdf:type of a given resource and its related resources as necessary to
answer relational navigation queries without doing joins across cluster
nodes. Updates to the store are handled concurrently and incrementally.

The architecture today has the ability to be scaled to provide storage on
the order of 10 gigaquads simply by adding more nodes to the cluster.
Additional improvements in our development pipeline will add an additional
10x on top of that.

A secondary difference, in contrast to the Garlik store, is that this is a
commercially-supported software product as opposed to a hosted Web service,
although we do provide hosting for applications like the the Oracle
Technology Network Semantic Web (http://otnsemanticweb.oracle.com).

So, yes, Danny: not only is it doable, it's shipping. ;-) - regards, BPA

-- 
Bradley P. Allen
Founder and CTO
Siderean Software, Inc.
work: +1 310 647 5610
cell: +1 310 951 4300
skype: bpa777
YIM: bpallen777

On 4/27/07 3:58 AM, "Danny Ayers" <danny.ayers@gmail.com> wrote:

> 
> On 26/04/07, Andreas Langegger <andreas.langegger@gmx.at> wrote:
> 
>> We are working on a distributed query processor for SPARQL. Any pointers
>> are appreciated.
> 
> You probably have this already:
> 
> DARQ - Federated Queries with SPARQL
> http://darq.sourceforge.net/
> 
> A while ago Steve Harris suggested there might be a chance the *big*
> store they've developed for garlik.com could be open sourced.
> Listening to this podcast -
> 
> http://talk.talis.com/archives/2007/04/tom_ilube_talks.html
> 
> - apparently it can run on a cluster of generic Linux boxes. So at
> least we know it's doable ;-)
> 
> Cheers,
> Danny.

Received on Monday, 30 April 2007 19:34:45 UTC