trillions of triples? from Jeremy J Carroll on 2013-03-13 (semantic-web@w3.org from March 2013)

From: Jeremy J Carroll <jjc@syapse.com>
Date: Wed, 13 Mar 2013 11:23:54 -0700
To: "semantic-web@w3.org" <semantic-web@w3.org>
Message-Id: <CE4AF834-9CD6-434A-B1E5-9B5918F34EA6@syapse.com>

Hi

What are the current limits for numbers of triples on a single compute node?
What war stories are there for combining multiple compute nodes to make larger triple stores?

====

I have taken some funny stuff, and I am hallucinating very large triple stores ….

Approximate design:

In read-only mode:

- data T naturally shards into:
    - 1 billion background triples B
    - 1000 shards of 1 billion triples each S_1, S_2, …..

T = B union S_1 union S_2 union ….

- the possible queries are shardable in the following sense
  each query q decomposes into m and r  [m does not stand for map, r does not stand for reduce]
  (q and m are both written in SPARQL)
  where m can be asked of any  ( S_j union B ) and 
  q ( T ) = r ( m( B union S_1 ), m( B union S_2 ), m(B union S_3 ) … )

  in practice q and m are close to identical, and r is either a SPARQL result-set union or fiddles with aggregates like adding up sums and counts.

i.e. we do not need/allow joins across the sharded data, but we may need to do aggregate queries and/or queries that select data out of each shard and combine …

====

My back of the envelope calculations suggest something like 10B triples per compute node, and say 5K compute nodes being within the state of the art for Hadoop like structure, which comes to 50 trillion triples …. with probably a factor of 2 or 4 on each of the two dimension being available with a bit of a squeeze, getting a max practical size of close to a quadrillion triples ….

any thoughts?

Jeremy

Received on Wednesday, 13 March 2013 18:24:23 UTC