Re: trillions of triples?

I would say that it's possible to get significantly more than 10B triples per node - the more nodes you have the shorter your Mean Time Between Failure will be, so it makes sense to minimise the number of nodes.

I would think 1TT on 16 or 32 nodes is a reasonable aim - specifically if you don't need replication - though that wouldn't be practical for a realworld deployment, it's OK for a demo.

The strategy you take varies dramatically depending on whether you need to be able to do live writes or not. If it's essentially read-only you can do impressive things with data compression.

The Virtuoso guys and Peter Boncz have done some great work with applying column store techniques to store RDF. There's papers out there.

With careful use of high bandwidth solid state storage devices you don't have to store the complete index in RAM, which lets you push up the number of quads you can store on a single box. It gets pricey though.

The 4store paper talks a lot about sharding, as far as I know that's still the best general purpose strategy, though for specific scenarios there are obviously better solutions. http://4store.org/publications/harris-ssws09.pdf

A lot of well-known sharding techniques rely on being able to migrate data from node-to-node - but IMHO that's impractical if you have live writes into the system.

- Steve

On 2013-03-13, at 18:23, Jeremy J Carroll <jjc@syapse.com> wrote:

> 
> Hi
> 
> What are the current limits for numbers of triples on a single compute node?
> What war stories are there for combining multiple compute nodes to make larger triple stores?
> 
> ====
> 
> I have taken some funny stuff, and I am hallucinating very large triple stores ….
> 
> Approximate design:
> 
> In read-only mode:
> 
> - data T naturally shards into:
>    - 1 billion background triples B
>    - 1000 shards of 1 billion triples each S_1, S_2, …..
> 
> T = B union S_1 union S_2 union ….
> 
> - the possible queries are shardable in the following sense
>  each query q decomposes into m and r  [m does not stand for map, r does not stand for reduce]
>  (q and m are both written in SPARQL)
>  where m can be asked of any  ( S_j union B ) and 
>  q ( T ) = r ( m( B union S_1 ), m( B union S_2 ), m(B union S_3 ) … )
> 
>  in practice q and m are close to identical, and r is either a SPARQL result-set union or fiddles with aggregates like adding up sums and counts.
> 
> i.e. we do not need/allow joins across the sharded data, but we may need to do aggregate queries and/or queries that select data out of each shard and combine …
> 
> ====
> 
> My back of the envelope calculations suggest something like 10B triples per compute node, and say 5K compute nodes being within the state of the art for Hadoop like structure, which comes to 50 trillion triples …. with probably a factor of 2 or 4 on each of the two dimension being available with a bit of a squeeze, getting a max practical size of close to a quadrillion triples ….
> 
> any thoughts?
> 
> Jeremy
> 
> 
> 
> 
> 
> 
> 

-- 
Steve Harris
Experian
+44 20 3042 4132
Registered in England and Wales 653331 VAT # 887 1335 93
80 Victoria Street, London, SW1E 5JL

Received on Thursday, 14 March 2013 10:43:36 UTC