W3C home > Mailing lists > Public > www-archive@w3.org > December 2001

Re: RDF Storage Systems

From: David McCusker <david@treedragon.com>
Date: Mon, 3 Dec 2001 05:19:17 -0500 (EST)
Message-ID: <3C0B4F53.3AFE58BA@treedragon.com>
To: Aaron Swartz <me@aaronsw.com>
Cc: www-archive@w3.org, mathew@oierw.net
Aaron,

> We're building an RDF API and a decentralized search system 
> on top of it for something called the Plex. Now we need a 
> storage system for it, and I know you have experience in this area.
> 
> Some thoughts so far:
> 
> IronDoc. Unfortunately it is still vaporware.
> 
> BerkeleyDB. Doesn't scale as well as Mathew (my partner, cced) wants
> -- he imagines 60GB stores. Also, I don't know the most efficient
> way to store RDF triples in key-value dictionaries.

I wrote a lengthy reply you can link at my weblog here:

http://www.treedragon.com/ged/map/ti/newDec01.htm#02dec01-scaling

scaling irondoc @ 
Aaron Swartz is interested in an RDF store that scales big. 
Of course IronDoc is vaporware and will stay so for months. 
I no longer try to give folks an impression it'll be soup soon. 
But it doesn't hurt to answer hypothetical usage questions. 
Thinking hypothetically is one way to analyze option context. 

In the following, please remember IronDoc is not available. 
I'd planned only to support up to four gigabyte files at first. 
Aaron is considering size figures more on the order of 60GB. 
This is rather big as content goes these days. A big disk size. 
I think I have useful advice to give about planning for failure. 

Everything fails. Partly cuz my code sucks and so does yours. 
But even if our code was quite fabulous, everything still fails. 
This is because entropy is inevitable, just like death and taxes. 
Systems only get more unstable. New ones replace old ones. 
So systems are more robust when they plan for partial failure. 

A system designed for partial failure tolerates some entropy. 
This prevents knife-edge total failure when a small part goes. 
Recovering from partial failure means rebuilding broken parts. 
This only works if you have some still good redundant copies. 
A 60GB database sounds a bit like all eggs in a basket to me. 

Or at least an engineer could make that mistake using 60 gigs. 
At least part of it will certainly fail sometime in normal use. 
So you'd want a design that didn't stall the whole thing then. 
Industrial databases have a measure of redundancy already. 
That's a good reason to use one. But value costs something. 

Okay, suppose you hate to use some high end db like Oracle. 
Or you lack the funds, etc. What do you do then? Is it free? 
You can be safe with a free database by using redundancy. 
I'd recommend a lot of write-only logging for later recovery. 
This is less chancy if you can accept tiny losses after damage. 

If you don't keep three copies, then I'd be worried about it. 
Keeping a lot of copies is more app practice than a db feature. 
So the remaining problem is how to scale huge with free code. 
The full 60GB db might be separated into numerous pieces. 
Decentralization is always a good idea, even in the cental db. 

Your biggest problem might be with single very large indexes. 
Typically the best scaling dictionary format is a type of btree. 
For a 60GB db, I could see one index going over 4GB in size. 
Or at least, you'd certainly worry about it. It could happen. 
If you used a 64-bit free database, it would double overhead. 

32 bits is more space efficient if you can partition the space. 
It's likely possible to put one btree into many separate files. 
I just haven't thought about it before. It seems very doable. 
For example, a btree root could use special children pointers. 
They would put the child subtree refs in separate file spaces. 

Until a file got too big, all children would go into one file. 
At some point, the root node uses another file when it splits. 
Okay, here's more detail. Consider one btree inside one file. 
For a tree of height N, a full tree might fit into a single file. 
If the tree gets higher, the node format must use bigger refs. 

Or maybe the namespace for a child subtree can be just a byte. 
This allows clumping namespace bytes at one end of a node. 
So keeping between-file references sorted is not too ungainly. 
However, transacting multiple files is not easy. Ask anybody. 
(I think low end solutions solve it by igoring the problem.) 

irondoc target usage @ 
IronDoc could scale huge using techniques like the above. 
But I was aiming for a smaller target in my near term plans. 
I want something scaling from very small to moderately big. 
The idea is to replace feeble tiny ad hoc single file solutions. 
Many folks code a quick and dirty solution that doesn't scale. 

Typically this don't work once the db doesn't fit in memory. 
IronDoc mainly aims to solve this size transition gracefully. 
It lets you code a new hobby project without much design. 
Then it wouldn't suck as it scaled ten thousand times bigger. 
The idea is to jump from 100K files to 1GB near painlessly. 

That's four orders of magnitude, which is good in a first try. 
It ought to work very well for personal user sized databases. 
But as you get ever bigger, the risk factor keeps increasing. 
Obviously the cost of loss goes up the more you have to lose. 
Protection agaist loss starts to cost more than time efficiency. 

Since IronDoc is pre 1.0 vaporware, it's dumb to think 2.0. 
But a later db version could easily be written for 64-bit files. 
It would only involve a mildly ugly port of the 32-bit source. 
Simpler is better in first versions. And I want to cover small. 
(Folks worried about 64-bit versions of Quilt five years ago.) 

Speaking of feeble formats, I mean Microsoft docfile format. 
I'm not denigrating the random feeble formats of all readers. 
(If you want a personal insult, ask first and then get in line.) 
Actually, I don't care much about invading docfile's space. 
I just needed to give you a specific target for your umbrage.
Received on Thursday, 6 December 2001 16:18:56 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 7 November 2012 14:17:15 GMT