RE: triple (quad) storage sizing

All

With Virtuoso, compressing row-wise, we get an average 27 bytes per quad in
allocated database pages, excluding literals and IRI strings.  The index
scheme  is PSOG, POGS, OP, SP, GS.  Note that the three last are not
covering indices but projections of distinct values from of the columns
concerned, hence smaller.  POGS is bitmap compressed on S.

With column-wise  compression, we get between 9.8 bytes per quad (Dbpedia)
and 6.4 bytes (BSBM or RDF-ized TPCH).  The logical index layout is as with
the row-wise model but the physical layout is column-wise.   The column
store is quite operational but is not generally available as yet.

Column-wise compression will also cut down on the literals and IRI strings
since a contiguously stored column of these strings will have repetition
that is ameanable to low-cost stream compression, e.g. LZO, Snappy.  We have
not done this yet.  Against common belief, the column store is quite OK for
inserts, in fact sometimes a notch faster than the row-wise equivalent.


These matters are further explained in the paper linked from my blog
http://virtuoso.openlinksw.com/blog.  The post in question is around Sep
2010, about the VLDB Semdata workshop.




Orri


-----Original Message-----
From: semantic-web-request@w3.org [mailto:semantic-web-request@w3.org] On
Behalf Of Paul Gearon
Sent: Wednesday, May 11, 2011 5:22 PM
To: Steve Harris
Cc: William Waites; Semantic-Web
Subject: Re: triple (quad) storage sizing

On Wed, May 11, 2011 at 5:59 AM, Steve Harris <steve.harris@garlik.com>
wrote:
> On 2011-05-10, at 14:45, William Waites wrote:
>> Typically there will be three different indexes for useful 
>> permutations of (s,p,o,g) -- (g,s,p,o), (p,s,o,g), (o,p,s,g) for 
>> example. Assuming three indexes, a safe estimate is 96 bytes (3x 32) 
>> per triple.
> ...
>
> This doesn't follow, e.g. bitmap indexes, and the index structure that
5store uses are a lot more compact than that.
>
> Don't discount the indexing of lexical values for nodes though, for some
datasets that can be quite expensive, anything up to 3x the size of the quad
index.

Quite true. For instance, data sets with lots of strings (particularly large
strings) can get expensive to store. This can be a much more important
influence than the number of triples (or quads).

In the case of Parliament the lexical indices form the basis of the quad
indexing. So almost all of the work and space is in those indexes. The quads
themselves are just stored flat, and have the potential to be packed in more
tightly with compression.

Speaking of which, some structures are amenable to compression, which is
useful in terms of either space or bandwidth, particularly when CPUs
typically operate at speeds that are orders of magnitude faster than the
other bottlenecks in the system. This should be considered as well.

Paul Gearon
Revelytix, Inc

Received on Wednesday, 11 May 2011 17:32:55 UTC