- From: William Waites <ww@styx.org>
- Date: Tue, 10 May 2011 15:45:20 +0200
- To: Semantic-Web <semantic-web@w3.org>
I'm looking at requirements for making available some large datasets, and ran the back of the envelope calculation below. It is vendor and implementation agnostic for the most part and I'd like to know if this reasoning makes sense or if I'm missing something important. Suppose we want to store quads, (s,p,o,g). The size of the lexical values isn't terribly important, but indexes are. Ideally we want to keep indexes in RAM or if that isn't possible on a fast disk like an SSD or a 15kRPM SAS disk. Each term will get assigned a number, probably 64 bit. Assuming we have a magic data structure that doesn't need to use any pointers, one entry in one index will typically take up 32 bytes (not quite for e.g. a predicate-rooted index because the number of distinct predicates will typically be very small, also not quite true for a graph-rooted index if your dataset makes light use of graphs). Typically there will be three different indexes for useful permutations of (s,p,o,g) -- (g,s,p,o), (p,s,o,g), (o,p,s,g) for example. Assuming three indexes, a safe estimate is 96 bytes (3x 32) per triple. Multiplying this out, a 200 million triple dataset will want about 18Gb of RAM to be truly happy, and that's without any space for the working set for evaluating queries. This does seem quite a lot, especially when you start considering datasets with sizes an order of magnitude larger than that. Is this reasoning sound? Cheers, -w -- William Waites <mailto:ww@styx.org> http://river.styx.org/ww/ <sip:ww@styx.org> F4B3 39BF E775 CF42 0BAB 3DF0 BE40 A6DF B06F FD45
Received on Tuesday, 10 May 2011 13:45:45 UTC