Re: RDF tools as workhorse from Jeen Broekstra on 2005-09-14 (semantic-web@w3.org from September 2005)

From: Jeen Broekstra <jeen@aduna.biz>
Date: Wed, 14 Sep 2005 10:12:01 +0200
To: Mailing Lists <list@thirdstation.com>
CC: semantic-web@w3.org
Message-ID: <4327DB51.2030101@aduna.biz>
Mailing Lists wrote:

> Does anyone on the list have some real-world stories to share about 
> using RDF and its tools as a backend technology?  The company I work for 
> maintains a database of metadata.  I'd like to explore using RDF instead 
> of our current schemas.

Aduna (http://aduna.biz/) uses the Sesame RDF store 
(http://www.openrdf.org/) as backend technology in its products: 
AutoFocus, Metadata Server and Spectacle. These are all tools for 
navigating/browsing/searching/visualizing large amounts of information 
(typically things like corporate intranets or databases).

> For example:   I have a lot of data about books.  I'd like to translate 
> the data into RDF/XML and dump it into an RDF database.  Then, taking a 
> particular book, I'd like to query the database to extract related 
> information like: other books by the same author, other books with the 
> same subject code, etc.
> 
> My concerns relate to:
> 1) Performance -- Right now we query the database using SQL.  Sometimes 
> it is _very_ slow.  That's mainly because the data is distributed across 
> tables and there are a lot of joins going on.  It seems like using RDF 
> would allow us to use simple queries.

Partly true, however, the performance you get with a store is dependent 
on its backend implementation, and different query patterns may greatly 
influence performance.

For example in Sesame, there are three different backends: in-memory 
(with file-dump for persistence), native on-disk storage, and RDBMS 
storage (MySQL/PostgreSQL/Oracle). Each has its own strong points and 
limitations.

The in-memory store is fastest but of course has limits to its 
scalability (we observed it uses roughly 170 bytes of RAM memory per 
triple on a 64-bit architecture). The native store is the next-fastest 
but currently is also limited in scalability, mainly because we have not 
yet had enough time to further develop the indexing strategy. This also 
means that for some types of query patterns it is very fast but for 
others (for which there is no index) it can be deadly slow. The RDBMS 
backend is typically most scalable but also the slowest in both adding 
and querying: queries with long path expressions tend to be translated 
to multi-join SQL queries, which are quite expensive. Of course, a lot 
here depends on the configuration of your RDBMS.

> 2) Scalability -- Our triplestore would be HUGE.  I'd estimate 10-20 
> Million triples.  Is that small or large in RDF circles?

It's a lot, but most serious triple stores can handle this I'd say. 
Sesame certainly can, and I expect Kowari can as well.

> 3) Productivity -- It's usually easier for me to envision creating RDF 
> from our source data than massaging the data to fit into our database 
> schema.  The same goes for when I'm extracting data - it seems like it 
> would be much easier to express my query as a triple using wildcards for 
> the data I want.

I think you've hit on one of the main benefits of RDF, in fact one of 
the main reasons our company chose to use RDF as the backend technology 
for its tooling: flexibility. The open model allows one to flexibly 
adapt representation of the data, plus you can conceivably map any kind 
of source to an RDF graph relatively easily.

> Any information will be helpful.  I'm interested in learning from other 
> peoples' experiences.

It takes time to get this right. Most triple stores are new and 
therefore not so robust that everything works - for your situation - out 
of the box. But with some patience and willingness to do an initial 
investment in setting up your architecture right, you can gain a lot in 
terms of flexibility later on IMHO.

What we have seen in our products (we switched from an internal 
dedicated storage format to the more generic RDF framework) is an 
initial sharp performance drop of the tools, followed by a gradual 
increase (as we understood the problems better and fixed them one by 
one) again. I'd say that if your primary concern is query performance 
then perhaps using a triple store is not the way to go, but if 
flexibility, ease of adaptation, ease of interoperation with other tools 
(etc.) also play a significant role, it may certainly be worth the cost.

HTH,

Jeen
Received on Wednesday, 14 September 2005 08:12:46 UTC