Re: Tools for 20 million triples?

Charles McCathieNevile wrote:
> Hi folks,
> 
> on another list someone asked what tools would be good for handling an
> OWL ontology of about 25,000 terms, with around 20 million triples. There
> were a
> handful of ideas about how to build specialised SQL systems or similar, but
> Danny Ayers pointed out that there are systems capable of handling RDF and a
> lot of triples (which by lucky chance happens to be a way of storing OWL).
> 
> So I wondered if anyone on this list had experience of tools working with
> this size dataset. (I will read Dave Beckett's report done for SWAD-Europe on
> the topic, but I suspect that there is already new information available, and
> would like to be up to date).
> 
> Cheers
> 
> Chaals
> 

I would be remiss in my duties not to mention our Java triple stores 
Kowari and TKS.  Our current single system has been tested to handle 
around 215 million triples so that gives you plenty of room to grow. 
The iTQL query layer in TKS also has the feature to query multiple data 
sources at once so you could scale up that way too.

The currently available CVS version of Kowari 
(http://sf.net/projects/kowari) can do 20 million triples in about an 1 
hour 10 minutes on an Opteron 240 (1.4 GHz).  We use mapped I/O for 64 
bit systems like Opteron or Sun systems.

For 32 bit systems (like a Pentium 4) there's some limitations which 
we've been working on.  With mapped I/O you soon reach a limit, at about 
3-4 million triples, and explicit only loads at about 800 
triples/second.  Explicit I/O has no practical limit (except for time 
and the number of longs) to the number of triples you can add.

The current CVS version also has Jena, RDQL support and JRDF interfaces.

Our current internal development version loads 20 million triples in an 
hour on the same Opteron system.  On the same system, we're loading 200 
million triples at a rate of 2,100 triples/second.  The 32 bit system 
load time is now around 3 times faster giving it about 2,500 
triples/second.  That would load your data in around 2 hours.

We've also got some other changes that may give us more speed 
improvements - especially over large data sets.

We're now getting a nice mix of I/O and CPU bound behaviour and being 
bound to things outside our system like the ARP parser.

We plan to release that in the next few weeks or so.

Received on Thursday, 25 March 2004 18:36:05 UTC