Re: Fwd: Looking for the right RDF store(s) from Atanas Kiryakov on 2011-02-25 (semantic-web@w3.org from February 2011)

From: Atanas Kiryakov <naso@sirma.bg>
Date: Fri, 25 Feb 2011 19:40:35 +0200
To: Markus Krötzsch <markus.kroetzsch@comlab.ox.ac.uk>
CC: semantic-web@w3.org, "OWLIM-info@ontotext.com" <OWLIM-info@ontotext.com>
Message-ID: <4D67E993.6020709@sirma.bg>
Dear Markus,

my second round of comments follow inline

>>>> SPARQL Update is mandatory. The following are softer requirements:
>>>>
>>>> (1) Free software version available (crucial to many of our users,
>>>> essential if we are to make it
>>>> our default store),
>>
>> yes, this is SwiftOWLIM, [2]
>
> OK, I hope that Damian's follow-up question on this can be resolved.

As clarified in response to Damian, substantial parts of SwiftOWLIM are freeware, but not open source.

>>>> (2) Robust & reliable for medium to large datasets
>>
>> Can you quantify this requirement a bit?
>> 1M statements, 10M statements?
>> SwiftOWLIM can easily handle 10+ millions of statements within 2GB
>> (32-bit JVM)
>
> That will suffice for a start. Users with very specific requirements will need to look at specific
> systems. We are looking for a default recommendation.

Given more memory (which also means 64-bit environment) it's scalability goes up quite linearly. 
While these figures can vary a bit across different datasets and ontologies, given 24GB of RAM one 
should be able to deal with 100M explicit triples along with some inferred ones. And the assembly 
cost of a desktop machine with 24GB of RAM is in the range of $2000. Does, one can handle quite 
serious volumes of data and query loads with SwiftOWLIM without owing a super computer

>>>> (3) Good handling of many concurrent queries (of controlled complexity)
>>
>> BigOWLIM deals very well with concurrent queries. Look at section 6.2 of
>> [3]
>>
>> While there are stores which do a bit better on this independent
>> benchmark, BigOWLIM has the best performance on concurrent queries, out
>> of those stores which are able to handle mixes of SPARQL 1.1 updates and
>> regular queries (look at the Explore and Update scenario results in
>> section 6.1.2)
>
> This is nice but we first need a free base system as a default. This can of course be a door-opener
> for other license models, but we need to start with something that people can use without purchasing
> a license. How does your free tool perform on concurrent queries? The amount of parallel requests
> may vary in our case, since we also have some levels of higher level caches that reduce
> re-computation of queries. But it is not uncommon that sudden visits of search engines request many
> pages that are otherwise low in user interest and that are no longer available in any cache.

I cannot give you a number on top of my head, but I can reconfirm that the design of SwiftOWLIM 
allows for good usage of the parallelism of the hardware. We can make a BSBM test on SwiftOWLIM and 
provide some data early next week

>>>> (4) Good handling of continuous updates (some update lag is
>>>> acceptable, but the store should not
>>>> impose update scheduling on the user)
>>
>> Yes, all versions of OWLIM are designed so that they allow for
>> continuous updates, as you define them. Updates become "visible" for
>> read queries shortly after commit of the transaction. Update
>> transactions do not block the evaluation of queries, apart from *very*
>> short periods of locking of index pages
>>
>> BigOWLIM demonstrated that it can handle updates simultaneously with
>> vast amounts of queries at BBC's web site for the World Cup in the
>> summer of 2010. More information about the overall setup is available in
>> [4]. A cluster of few machines at BBC was able to cope with more than a
>> million SPARQL queries per day, while handling hundreds of updates each
>> hour. BTW, while handle these loads, BigOWLIM was constantly performing
>> reasoning (based on materialisation), which is out of scope for most of
>> the other engines
>
> This is very useful to know. There is one thing I forgot in my original email is that we use limited
> amounts of reasoning. It would be helpful to have owl:sameAs and class/property hierarchies
> available. I know that OWLIM can handle this, but how does this interfere with incremental updates?

With all versions of OWLIM reasoning take place upon transaction commit. So, each time transaction 
is successfully committed the "deductive closure" is updated wrt the changes in the explicit data. 
Doesn't matter whether you add or delete statements. The baseline is that you configure the 
repository to support specific inference semantics (RDFS, OWL Horst, OWL 2 RL and QL, or none) and 
you don't have to care about anything else. Reasoning takes place transparrently

Regarding owl:sameAs, BigOWLIM handles it an optimized manner to avoid potential negative effects of 
massive usage of owl:sameAs and brute force enforcement of its entailment semantics. You can read 
more on it here: http://ontotext.com/owlim/owl-sameAs-optimisation.html

SwiftOWLIM would do it brute force, which for small and middle-sized datasets is not such a problem

>>>> (5) Good support for datatype-related queries (numerical and
>>>> lexicographic sorting, ideally also
>>>> distance queries for geographic data, ideally some forms of string
>>>> pattern matching)
>>
>> Queries involving constraints and ordering wrt literals out of the
>> standard data types are handled smoothly; otherwise OWLIM cannot score
>> well on a benchmark such as BSBM
>
> We would be happy with interval restrictions on number and string data.

At present we do not perform number normalisation and special purpose indexing. Thus, comparing 
number literals is done based on their string representation. Thus, if one needs to do accurate 
interval searches and other constraints on numbers, s/he should take care to normalise the string 
representation of the numbers in the application. Still, performance-wise doing such types of 
constraints on literals is fast enough for usage scenarios we have seen so far


>> BigOWLIM does offer geo-spatial indexing and queries, as described in
>> [5]. There are also several modalities of integrated FTS, [6]
>
> It is good to see that various RDF store vendors are working on geo support. In most applications,
> we could also live with bounding-box-type matching as long as numeric ranges can be queried. So this
> is not a hard requirement.

Well, this depends on the size of the dataset you deal with. Trying to get from Geonames the 
airports within 50 miles from London, without a geo-spatial index can be 500 times slower than when 
such index is used, as presented here: http://ontotext.com/owlim/geo.html


Cheers
Naso


----------------------------------------------------------
Atanas Kiryakov
Executive Director
Ontotext AD, http://www.ontotext.com
Phone: (+359 2) 974 61 44; Fax: 975 3226
----------------------------------------------------------
Received on Friday, 25 February 2011 17:41:10 UTC