Re: triple (quad) storage sizing from William Waites on 2011-05-10 (semantic-web@w3.org from May 2011)

From: William Waites <ww@styx.org>
Date: Tue, 10 May 2011 20:26:35 +0200
To: Andy Seaborne <andy.seaborne@epimorphics.com>
Cc: semantic-web@w3.org
Message-ID: <20110510182635.GM31006@styx.org>

* [2011-05-10 17:39:30 +0100] Andy Seaborne <andy.seaborne@epimorphics.com> écrit:
]
] There are compression techniques, or data structures that don't store 
] the whole of the quad where there is repetition. For example, for 
] (g,s,p,o), some index data structures can store one (g,s) and all the 
] (p,o).  This might well be done by a index data structure that stores 
] common prefixes anyway rather than needing a special data structure for
] RDF quads.

So if I understand correctly, this example means, best case where a
subject only occurs in exactly one graph, that we get basically the
same properties as a triplestore, so a savings of 25%. There are
probably diminishing returns when one tries to do that with, e.g.
(s,p) and (o) unless many repeating predicates on the same subject are
very common (e.g. not the case with most real datasets).

So then there are two questions.

  Theory: can we do better than that?
  Practice: which triple/quad stores do this and what are the rules
    of thumb for bytes/statement to factor in when speccing hardware?

-w
-- 
William Waites                <mailto:ww@styx.org>
http://river.styx.org/ww/        <sip:ww@styx.org>
F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45

Received on Tuesday, 10 May 2011 18:26:59 UTC