Re: Fwd: Looking for the right RDF store(s)

On 25/02/2011 10:58, Atanas Kiryakov wrote:
> Dear Markus,
>
>>> Some installations of Semantic MediaWiki have become quite big, and
>>> our users are looking for
>>> faster storage backends (than MySQL!) for query answering. We will
>>> provide RDF store bindings via
>>> SPARQL Update, but we are unsure which RDF stores to recommend to our
>>> users. Input is appreciated
>>> (also let me know if I should rather take this to a more specific list).
>
> I believe OWLIM, [1] can do the job for SMW. All editions are pure Java
> implementations. They can be integrated through either Sesame or Jena
> without loss of functionality or performance. Thus integration and
> portability should not be a an issue

This is not a strict requirement but can be convenient in some cases. In 
particular since some users already work with Jena or Sesame.

>
>>> SPARQL Update is mandatory. The following are softer requirements:
>>>
>>> (1) Free software version available (crucial to many of our users,
>>> essential if we are to make it
>>> our default store),
>
> yes, this is SwiftOWLIM, [2]

OK, I hope that Damian's follow-up question on this can be resolved.

>
>>> (2) Robust & reliable for medium to large datasets
>
> Can you quantify this requirement a bit?
> 1M statements, 10M statements?
> SwiftOWLIM can easily handle 10+ millions of statements within 2GB
> (32-bit JVM)

That will suffice for a start. Users with very specific requirements 
will need to look at specific systems. We are looking for a default 
recommendation.

>
>>> (3) Good handling of many concurrent queries (of controlled complexity)
>
> BigOWLIM deals very well with concurrent queries. Look at section 6.2 of
> [3]
>
> While there are stores which do a bit better on this independent
> benchmark, BigOWLIM has the best performance on concurrent queries, out
> of those stores which are able to handle mixes of SPARQL 1.1 updates and
> regular queries (look at the Explore and Update scenario results in
> section 6.1.2)

This is nice but we first need a free base system as a default. This can 
of course be a door-opener for other license models, but we need to 
start with something that people can use without purchasing a license. 
How does your free tool perform on concurrent queries? The amount of 
parallel requests may vary in our case, since we also have some levels 
of higher level caches that reduce re-computation of queries. But it is 
not uncommon that sudden visits of search engines request many pages 
that are otherwise low in user interest and that are no longer available 
in any cache.

>
>>> (4) Good handling of continuous updates (some update lag is
>>> acceptable, but the store should not
>>> impose update scheduling on the user)
>
> Yes, all versions of OWLIM are designed so that they allow for
> continuous updates, as you define them. Updates become "visible" for
> read queries shortly after commit of the transaction. Update
> transactions do not block the evaluation of queries, apart from *very*
> short periods of locking of index pages
>
> BigOWLIM demonstrated that it can handle updates simultaneously with
> vast amounts of queries at BBC's web site for the World Cup in the
> summer of 2010. More information about the overall setup is available in
> [4]. A cluster of few machines at BBC was able to cope with more than a
> million SPARQL queries per day, while handling hundreds of updates each
> hour. BTW, while handle these loads, BigOWLIM was constantly performing
> reasoning (based on materialisation), which is out of scope for most of
> the other engines

This is very useful to know. There is one thing I forgot in my original 
email is that we use limited amounts of reasoning. It would be helpful 
to have owl:sameAs and class/property hierarchies available. I know that 
OWLIM can handle this, but how does this interfere with incremental updates?

>
>>> (5) Good support for datatype-related queries (numerical and
>>> lexicographic sorting, ideally also
>>> distance queries for geographic data, ideally some forms of string
>>> pattern matching)
>
> Queries involving constraints and ordering wrt literals out of the
> standard data types are handled smoothly; otherwise OWLIM cannot score
> well on a benchmark such as BSBM

We would be happy with interval restrictions on number and string data.

>
> BigOWLIM does offer geo-spatial indexing and queries, as described in
> [5]. There are also several modalities of integrated FTS, [6]

It is good to see that various RDF store vendors are working on geo 
support. In most applications, we could also live with bounding-box-type 
matching as long as numeric ranges can be queried. So this is not a hard 
requirement.

<snip>

>
> 10M triples can be handled in comfort by SwiftOWLIM in memory, given a
> machine with 2-4GB of RAM

This will suffice in many of our applications. We currently tend to 
cover the long tail of semantic data management, i.e. a big number of 
sites with small and medium amounts of data in each. Ironically, these 
sites could be more vulnerable to concurrent query bursts since they 
have less powerful servers and under-developed caching mechanisms.

>
>>> What we are looking for are good candidates to recommend in general,
>>> knowing that users will still
>>> need to pick the optimal solution for their individual data sets.
>>> What we can offer to RDF store
>>> suppliers is significant visibility in various user communities
>>> (e.g., our biggest web user is
>>> Wikia, hosting about 30.000 wiki communities; SMW users in industry
>>> would also appreciate more
>>> sophisticated storage solutions).
>
> Shall you have any questions, please, do not hesitate to contact us

Thanks, I will probably have more questions when we start concrete 
testing. For now my main open question is the one about the freeness of 
SwiftOWLIM.

Best regards,

Markus



-- 
Dr. Markus Krötzsch
Oxford  University  Computing  Laboratory
Room 306, Parks Road, Oxford, OX1 3QD, UK
+44 (0)1865 283529    http://korrekt.org/

Received on Friday, 25 February 2011 12:32:47 UTC