triple (quad) storage sizing from William Waites on 2011-05-10 (semantic-web@w3.org from May 2011)

From: William Waites <ww@styx.org>
Date: Tue, 10 May 2011 15:45:20 +0200
To: Semantic-Web <semantic-web@w3.org>
Message-ID: <20110510134520.GK31006@styx.org>

I'm looking at requirements for making available some large datasets,
and ran the back of the envelope calculation below. It is vendor and
implementation agnostic for the most part and I'd like to know if this
reasoning makes sense or if I'm missing something important.

Suppose we want to store quads, (s,p,o,g). The size of the lexical
values isn't terribly important, but indexes are. Ideally we want to
keep indexes in RAM or if that isn't possible on a fast disk like an
SSD or a 15kRPM SAS disk.

Each term will get assigned a number, probably 64 bit. Assuming we
have a magic data structure that doesn't need to use any pointers, one
entry in one index will typically take up 32 bytes (not quite for
e.g. a predicate-rooted index because the number of distinct
predicates will typically be very small, also not quite true for a
graph-rooted index if your dataset makes light use of graphs).

Typically there will be three different indexes for useful
permutations of (s,p,o,g) -- (g,s,p,o), (p,s,o,g), (o,p,s,g) for
example. Assuming three indexes, a safe estimate is 96 bytes (3x 32)
per triple.

Multiplying this out, a 200 million triple dataset will want about
18Gb of RAM to be truly happy, and that's without any space for the
working set for evaluating queries.

This does seem quite a lot, especially when you start considering
datasets with sizes an order of magnitude larger than that.

Is this reasoning sound?

Cheers,
-w
-- 
William Waites                <mailto:ww@styx.org>
http://river.styx.org/ww/        <sip:ww@styx.org>
F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45

Received on Tuesday, 10 May 2011 13:45:45 UTC