- From: Orri Erling <erling@xs4all.nl>
- Date: Wed, 11 May 2011 19:31:16 +0200
- To: "'Paul Gearon'" <gearon@ieee.org>, "'Steve Harris'" <steve.harris@garlik.com>
- Cc: "'William Waites'" <ww@styx.org>, "'Semantic-Web'" <semantic-web@w3.org>
All With Virtuoso, compressing row-wise, we get an average 27 bytes per quad in allocated database pages, excluding literals and IRI strings. The index scheme is PSOG, POGS, OP, SP, GS. Note that the three last are not covering indices but projections of distinct values from of the columns concerned, hence smaller. POGS is bitmap compressed on S. With column-wise compression, we get between 9.8 bytes per quad (Dbpedia) and 6.4 bytes (BSBM or RDF-ized TPCH). The logical index layout is as with the row-wise model but the physical layout is column-wise. The column store is quite operational but is not generally available as yet. Column-wise compression will also cut down on the literals and IRI strings since a contiguously stored column of these strings will have repetition that is ameanable to low-cost stream compression, e.g. LZO, Snappy. We have not done this yet. Against common belief, the column store is quite OK for inserts, in fact sometimes a notch faster than the row-wise equivalent. These matters are further explained in the paper linked from my blog http://virtuoso.openlinksw.com/blog. The post in question is around Sep 2010, about the VLDB Semdata workshop. Orri -----Original Message----- From: semantic-web-request@w3.org [mailto:semantic-web-request@w3.org] On Behalf Of Paul Gearon Sent: Wednesday, May 11, 2011 5:22 PM To: Steve Harris Cc: William Waites; Semantic-Web Subject: Re: triple (quad) storage sizing On Wed, May 11, 2011 at 5:59 AM, Steve Harris <steve.harris@garlik.com> wrote: > On 2011-05-10, at 14:45, William Waites wrote: >> Typically there will be three different indexes for useful >> permutations of (s,p,o,g) -- (g,s,p,o), (p,s,o,g), (o,p,s,g) for >> example. Assuming three indexes, a safe estimate is 96 bytes (3x 32) >> per triple. > ... > > This doesn't follow, e.g. bitmap indexes, and the index structure that 5store uses are a lot more compact than that. > > Don't discount the indexing of lexical values for nodes though, for some datasets that can be quite expensive, anything up to 3x the size of the quad index. Quite true. For instance, data sets with lots of strings (particularly large strings) can get expensive to store. This can be a much more important influence than the number of triples (or quads). In the case of Parliament the lexical indices form the basis of the quad indexing. So almost all of the work and space is in those indexes. The quads themselves are just stored flat, and have the potential to be packed in more tightly with compression. Speaking of which, some structures are amenable to compression, which is useful in terms of either space or bandwidth, particularly when CPUs typically operate at speeds that are orders of magnitude faster than the other bottlenecks in the system. This should be considered as well. Paul Gearon Revelytix, Inc
Received on Wednesday, 11 May 2011 17:32:55 UTC