- From: Atanas Kiryakov <naso@sirma.bg>
- Date: Fri, 25 Feb 2011 19:40:35 +0200
- To: Markus Krötzsch <markus.kroetzsch@comlab.ox.ac.uk>
- CC: semantic-web@w3.org, "OWLIM-info@ontotext.com" <OWLIM-info@ontotext.com>
Dear Markus, my second round of comments follow inline >>>> SPARQL Update is mandatory. The following are softer requirements: >>>> >>>> (1) Free software version available (crucial to many of our users, >>>> essential if we are to make it >>>> our default store), >> >> yes, this is SwiftOWLIM, [2] > > OK, I hope that Damian's follow-up question on this can be resolved. As clarified in response to Damian, substantial parts of SwiftOWLIM are freeware, but not open source. >>>> (2) Robust & reliable for medium to large datasets >> >> Can you quantify this requirement a bit? >> 1M statements, 10M statements? >> SwiftOWLIM can easily handle 10+ millions of statements within 2GB >> (32-bit JVM) > > That will suffice for a start. Users with very specific requirements will need to look at specific > systems. We are looking for a default recommendation. Given more memory (which also means 64-bit environment) it's scalability goes up quite linearly. While these figures can vary a bit across different datasets and ontologies, given 24GB of RAM one should be able to deal with 100M explicit triples along with some inferred ones. And the assembly cost of a desktop machine with 24GB of RAM is in the range of $2000. Does, one can handle quite serious volumes of data and query loads with SwiftOWLIM without owing a super computer >>>> (3) Good handling of many concurrent queries (of controlled complexity) >> >> BigOWLIM deals very well with concurrent queries. Look at section 6.2 of >> [3] >> >> While there are stores which do a bit better on this independent >> benchmark, BigOWLIM has the best performance on concurrent queries, out >> of those stores which are able to handle mixes of SPARQL 1.1 updates and >> regular queries (look at the Explore and Update scenario results in >> section 6.1.2) > > This is nice but we first need a free base system as a default. This can of course be a door-opener > for other license models, but we need to start with something that people can use without purchasing > a license. How does your free tool perform on concurrent queries? The amount of parallel requests > may vary in our case, since we also have some levels of higher level caches that reduce > re-computation of queries. But it is not uncommon that sudden visits of search engines request many > pages that are otherwise low in user interest and that are no longer available in any cache. I cannot give you a number on top of my head, but I can reconfirm that the design of SwiftOWLIM allows for good usage of the parallelism of the hardware. We can make a BSBM test on SwiftOWLIM and provide some data early next week >>>> (4) Good handling of continuous updates (some update lag is >>>> acceptable, but the store should not >>>> impose update scheduling on the user) >> >> Yes, all versions of OWLIM are designed so that they allow for >> continuous updates, as you define them. Updates become "visible" for >> read queries shortly after commit of the transaction. Update >> transactions do not block the evaluation of queries, apart from *very* >> short periods of locking of index pages >> >> BigOWLIM demonstrated that it can handle updates simultaneously with >> vast amounts of queries at BBC's web site for the World Cup in the >> summer of 2010. More information about the overall setup is available in >> [4]. A cluster of few machines at BBC was able to cope with more than a >> million SPARQL queries per day, while handling hundreds of updates each >> hour. BTW, while handle these loads, BigOWLIM was constantly performing >> reasoning (based on materialisation), which is out of scope for most of >> the other engines > > This is very useful to know. There is one thing I forgot in my original email is that we use limited > amounts of reasoning. It would be helpful to have owl:sameAs and class/property hierarchies > available. I know that OWLIM can handle this, but how does this interfere with incremental updates? With all versions of OWLIM reasoning take place upon transaction commit. So, each time transaction is successfully committed the "deductive closure" is updated wrt the changes in the explicit data. Doesn't matter whether you add or delete statements. The baseline is that you configure the repository to support specific inference semantics (RDFS, OWL Horst, OWL 2 RL and QL, or none) and you don't have to care about anything else. Reasoning takes place transparrently Regarding owl:sameAs, BigOWLIM handles it an optimized manner to avoid potential negative effects of massive usage of owl:sameAs and brute force enforcement of its entailment semantics. You can read more on it here: http://ontotext.com/owlim/owl-sameAs-optimisation.html SwiftOWLIM would do it brute force, which for small and middle-sized datasets is not such a problem >>>> (5) Good support for datatype-related queries (numerical and >>>> lexicographic sorting, ideally also >>>> distance queries for geographic data, ideally some forms of string >>>> pattern matching) >> >> Queries involving constraints and ordering wrt literals out of the >> standard data types are handled smoothly; otherwise OWLIM cannot score >> well on a benchmark such as BSBM > > We would be happy with interval restrictions on number and string data. At present we do not perform number normalisation and special purpose indexing. Thus, comparing number literals is done based on their string representation. Thus, if one needs to do accurate interval searches and other constraints on numbers, s/he should take care to normalise the string representation of the numbers in the application. Still, performance-wise doing such types of constraints on literals is fast enough for usage scenarios we have seen so far >> BigOWLIM does offer geo-spatial indexing and queries, as described in >> [5]. There are also several modalities of integrated FTS, [6] > > It is good to see that various RDF store vendors are working on geo support. In most applications, > we could also live with bounding-box-type matching as long as numeric ranges can be queried. So this > is not a hard requirement. Well, this depends on the size of the dataset you deal with. Trying to get from Geonames the airports within 50 miles from London, without a geo-spatial index can be 500 times slower than when such index is used, as presented here: http://ontotext.com/owlim/geo.html Cheers Naso ---------------------------------------------------------- Atanas Kiryakov Executive Director Ontotext AD, http://www.ontotext.com Phone: (+359 2) 974 61 44; Fax: 975 3226 ----------------------------------------------------------
Received on Friday, 25 February 2011 17:41:10 UTC