Re: Fwd: Looking for the right RDF store(s) from Atanas Kiryakov on 2011-02-25 (semantic-web@w3.org from February 2011)

From: Atanas Kiryakov <naso@sirma.bg>
Date: Fri, 25 Feb 2011 12:58:15 +0200
To: semantic-web@w3.org
CC: markus.kroetzsch@comlab.ox.ac.uk, "OWLIM-info@ontotext.com" <OWLIM-info@ontotext.com>
Message-ID: <4D678B47.7070000@sirma.bg>
Dear Markus,

>> Some installations of Semantic MediaWiki have become quite big, and our users are looking for
>> faster storage backends (than MySQL!) for query answering. We will provide RDF store bindings via
>> SPARQL Update, but we are unsure which RDF stores to recommend to our users. Input is appreciated
>> (also let me know if I should rather take this to a more specific list).

I believe OWLIM, [1] can do the job for SMW. All editions are pure Java implementations. They can be 
integrated through either Sesame or Jena without loss of functionality or performance. Thus 
integration and portability should not be a an issue

>> SPARQL Update is mandatory. The following are softer requirements:
>>
>> (1) Free software version available (crucial to many of our users, essential if we are to make it
>> our default store),

yes, this is SwiftOWLIM, [2]

>> (2) Robust & reliable for medium to large datasets

Can you quantify this requirement a bit?
1M statements, 10M statements?
SwiftOWLIM can easily handle 10+ millions of statements within 2GB (32-bit JVM)

>> (3) Good handling of many concurrent queries (of controlled complexity)

BigOWLIM deals very well with concurrent queries. Look at section 6.2 of [3]

While there are stores which do a bit better on this independent benchmark, BigOWLIM has the best 
performance on concurrent queries, out of those stores which are able to handle mixes of SPARQL 1.1 
updates and regular queries (look at the Explore and Update scenario results in section 6.1.2)

>> (4) Good handling of continuous updates (some update lag is acceptable, but the store should not
>> impose update scheduling on the user)

Yes, all versions of OWLIM are designed so that they allow for continuous updates, as you define 
them. Updates become "visible" for read queries shortly after commit of the transaction. Update 
transactions do not block the evaluation of queries, apart from *very* short periods of locking of 
index pages

BigOWLIM demonstrated that it can handle updates simultaneously with vast amounts of queries at 
BBC's web site for the World Cup in the summer of 2010. More information about the overall setup is 
available in [4]. A cluster of few machines at BBC was able to cope with more than a million SPARQL 
queries per day, while handling hundreds of updates each hour. BTW, while handle these loads, 
BigOWLIM was constantly performing reasoning (based on materialisation), which is out of scope for 
most of the other engines

>> (5) Good support for datatype-related queries (numerical and lexicographic sorting, ideally also
>> distance queries for geographic data, ideally some forms of string pattern matching)

Queries involving constraints and ordering wrt literals out of the standard data types are handled 
smoothly; otherwise OWLIM cannot score well on a benchmark such as BSBM

BigOWLIM does offer geo-spatial indexing and queries, as described in [5]. There are also several 
modalities of integrated FTS, [6]


>> (6) Options for scaling out; but without being obliged to start with a multi-node server cluster.

Yes, BigOWLIM has a Replication Cluster edition, presented in [7]. In recent tests against BSBM 
100M, a cluster of 100 Amazon EC2 instances scored 200,000 BSBM QMpH, which is 5M queries/hour. The 
total Amazon EC2 costs for evaluating were $1 per 100,000 SPARQL queries!

>> I am aware that the above requirements are not specific -- indeed the details vary widely across
>> our applications (from 10 users and millions of pages to tenth of thousands of users and pages).
>> Single-user query performance and top speed for highly complex queries are not of much interest,
>> but robustness and reliability is. We consider some applications in the order of DBPedia En but
>> this is not typical. But cases with some 10 Mio triples should be covered. Of course,
>> well-equipped servers (RAID, SSD-based storage, loads of RAM, etc.) can be assumed.

10M triples can be handled in comfort by SwiftOWLIM in memory, given a machine with 2-4GB of RAM

>> What we are looking for are good candidates to recommend in general, knowing that users will still
>> need to pick the optimal solution for their individual data sets. What we can offer to RDF store
>> suppliers is significant visibility in various user communities (e.g., our biggest web user is
>> Wikia, hosting about 30.000 wiki communities; SMW users in industry would also appreciate more
>> sophisticated storage solutions).

Shall you have any questions, please, do not hesitate to contact us

Best regards,
Naso

[1] OWLIM's home page. http://ontotext.com/owlim/index.html
[2] OWLIM Editions. http://ontotext.com/owlim/version-map.html
[3] BSBM V3 Results (February 2011). 
http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/V6/index.html#comparison
[4] BBC World Cup 2010 dynamic semantic publishing. 
http://www.bbc.co.uk/blogs/bbcinternet/2010/07/bbc_world_cup_2010_dynamic_sem.html
[5] Geo-spatial indexing in OWLIM. http://ontotext.com/owlim/geo.html
[6] BigOWLIM User Guide. http://ontotext.com/owlim/BigOWLIM_user_guide_v3.4.pdf
[7] BigOWLIM Cluster Configuration. http://ontotext.com/owlim/cluster-configuration.html
[8] OWLIM Performance in the Amazon Cloud. http://ontotext.com/owlim/cloud-performance.html

----------------------------------------------------------
Atanas Kiryakov
Executive Director
Ontotext AD, http://www.ontotext.com
Phone: (+359 2) 974 61 44; Fax: 975 3226
----------------------------------------------------------
Received on Friday, 25 February 2011 10:58:51 UTC