Re: triple (quad) storage sizing

On 10/05/11 14:45, William Waites wrote:
> I'm looking at requirements for making available some large datasets,
> and ran the back of the envelope calculation below. It is vendor and
> implementation agnostic for the most part and I'd like to know if this
> reasoning makes sense or if I'm missing something important.
>
> Suppose we want to store quads, (s,p,o,g). The size of the lexical
> values isn't terribly important, but indexes are. Ideally we want to
> keep indexes in RAM or if that isn't possible on a fast disk like an
> SSD or a 15kRPM SAS disk.
>
> Each term will get assigned a number, probably 64 bit. Assuming we
> have a magic data structure that doesn't need to use any pointers, one
> entry in one index will typically take up 32 bytes (not quite for
> e.g. a predicate-rooted index because the number of distinct
> predicates will typically be very small, also not quite true for a
> graph-rooted index if your dataset makes light use of graphs).
>
> Typically there will be three different indexes for useful
> permutations of (s,p,o,g) -- (g,s,p,o), (p,s,o,g), (o,p,s,g) for
> example. Assuming three indexes, a safe estimate is 96 bytes (3x 32)
> per triple.
>
> Multiplying this out, a 200 million triple dataset will want about
> 18Gb of RAM to be truly happy, and that's without any space for the
> working set for evaluating queries.
>
> This does seem quite a lot, especially when you start considering
> datasets with sizes an order of magnitude larger than that.
>
> Is this reasoning sound?

There are compression techniques, or data structures that don't store 
the whole of the quad where there is repetition. For example, for 
(g,s,p,o), some index data structures can store one (g,s) and all the 
(p,o).  This might well be done by a index data structure that stores 
common prefixes anyway rather than needing a special data structure for
RDF quads.

	Andy


>
> Cheers,
> -w

Received on Tuesday, 10 May 2011 16:40:02 UTC