Re: Looking for the right RDF store(s) from Markus Krötzsch on 2011-02-25 (semantic-web@w3.org from February 2011)

From: Markus Krötzsch <markus.kroetzsch@comlab.ox.ac.uk>
Date: Fri, 25 Feb 2011 20:40:49 +0000
To: "Wood, Jamey" <Jamey.Wood@nrel.gov>
CC: "semantic-web@w3.org" <semantic-web@w3.org>
Message-ID: <4D6813D1.50307@comlab.ox.ac.uk>
On 25/02/2011 17:42, Wood, Jamey wrote:
> I'll put in a vote for Virtuoso [1].  Its open source edition is truly
> free software (GPL), and it's a top performer in most publicly-released
> benchmarks that I've seen [2].
>
> Since you mention wanting to support applications that are on the order of
> DBpedia, it seems only natural to give strong consideration to Virtuoso
> (as the triplestore that powers DBpedia).  I can also say that in our
> project (OpenEI.org [3]), Virtuoso has proven to be a nice complement to
> Semantic MediaWiki.

Good to hear that. I would certainly have considered Virtuoso, although 
this could really be more attractive if some pre-compiled packages were 
available for the open version.

> We currently accomplish this by importing a nightly
> RDF dump of the SMW data into Virtuoso (which then powers SPARQL queries
> and LOD views of that data).  We'd love to switch to more of a real-time
> flow of data from SMW into our triplestore, though.  So we're very
> interested in your work.

Interesting. Only few of our users directly take advantage of RDF (even 
now my main motive for incorporating an RDF store is the prospective 
performance gain for internal features). So I would like to learn more 
about your scenario. This may also have an impact on future changes in 
the RDF export. I recently changed some aspects of the encoding which do 
hopefully not affect your applications. Feel free to contact me off-list 
on this.

>
> In my understanding, a couple of the soft requirements you note (like
> clustering and geographic queries) would really be handled by the
> commercial edition of Virtuoso.  But at least that exists as an option for
> those who need it.

Agreed.

> Hopefully someone from the Virtuoso team can jump in
> with more of a point-by-point response to your requirements.

Yes, I would appreciate this.

> In general,
> I'm just piping up to say that our project is making extensive use of SMW
> and Virtuoso together as a solution already, and we'd love to see that
> pairing become stronger.

This is our intention. Some forms of live synchronisation of triple 
stores is already possible with extensions today. What we would like to 
achieve is that RDF stores become the primary backend for internal 
functions of our software, changing much of our internal storage 
accesses from SQL to SPARQL. This will still be some way to go on the 
implementation side, but my little survey here encourages me that the 
available free systems are more than capable of addressing our needs. 
Some years back when we did experiments with Sesame and Jena, we found 
that MySQL was often delivering better or at least more robust query 
performance, but now might be the time for us to switch.

- Markus

>
> On 2/24/11 9:28 AM, "Markus Krötzsch"<markus.kroetzsch@comlab.ox.ac.uk>
> wrote:
>
>> Hi,
>>
>> Some installations of Semantic MediaWiki have become quite big, and our
>> users are looking for faster storage backends (than MySQL!) for query
>> answering. We will provide RDF store bindings via SPARQL Update, but we
>> are unsure which RDF stores to recommend to our users. Input is
>> appreciated (also let me know if I should rather take this to a more
>> specific list).
>>
>> SPARQL Update is mandatory. The following are softer requirements:
>>
>> (1) Free software version available (crucial to many of our users,
>> essential if we are to make it our default store),
>> (2) Robust&  reliable for medium to large datasets
>> (3) Good handling of many concurrent queries (of controlled complexity)
>> (4) Good handling of continuous updates (some update lag is acceptable,
>> but the store should not impose update scheduling on the user)
>> (5) Good support for datatype-related queries (numerical and
>> lexicographic sorting, ideally also distance queries for geographic
>> data, ideally some forms of string pattern matching)
>> (6) Options for scaling out; but without being obliged to start with a
>> multi-node server cluster.
>>
>> I am aware that the above requirements are not specific -- indeed the
>> details vary widely across our applications (from 10 users and millions
>> of pages to tenth of thousands of users and pages). Single-user query
>> performance and top speed for highly complex queries are not of much
>> interest, but robustness and reliability is. We consider some
>> applications in the order of DBPedia En but this is not typical. But
>> cases with some 10 Mio triples should be covered. Of course,
>> well-equipped servers (RAID, SSD-based storage, loads of RAM, etc.) can
>> be assumed.
>>
>> What we are looking for are good candidates to recommend in general,
>> knowing that users will still need to pick the optimal solution for
>> their individual data sets. What we can offer to RDF store suppliers is
>> significant visibility in various user communities (e.g., our biggest
>> web user is Wikia, hosting about 30.000 wiki communities; SMW users in
>> industry would also appreciate more sophisticated storage solutions).
>>
>> Thanks,
>>
>> Markus
>>
>> --
>> Dr. Markus Krötzsch
>> Oxford  University  Computing  Laboratory
>> Room 306, Parks Road, Oxford, OX1 3QD, UK
>> +44 (0)1865 283529    http://korrekt.org/
>>
>
>


-- 
Dr. Markus Krötzsch
Oxford  University  Computing  Laboratory
Room 306, Parks Road, Oxford, OX1 3QD, UK
+44 (0)1865 283529    http://korrekt.org/
Received on Friday, 25 February 2011 20:44:29 UTC