ANN: New Berlin SPARQL Benchmark results for datasets ranging from 10 million to 150 billion RDF triples

Hi all,

Berlin SPARQL Benchmark (BSBM) is a benchmark for measuring the 
performance of storage systems that expose SPARQL endpoints. The 
benchmark is built around an e-commerce use case in which a set of 
products is offered by different vendors.The benchmark defines two query 
mixes:
1. The query mix of theExplore use case 
<http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/ExploreUseCase/index.html>illustrates 
the search and navigation pattern of a consumer looking for a product 
via some web portal.
2. The query mix of theBusiness Intelligence use case 
<http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/BusinessIntelligenceUseCase/index.html>simulates 
different stakeholders asking analytical questions against the dataset. 
The query mix relies heavily on SPARQL 1.1 constructs like GROUP BY and 
COUNT() and is designed to touch large portions of the benchmark dataset.

I'm happy to announce the results of a new BSBM benchmark experiment.  
The experiment compares the performance of

1. BigData
2. BigOwlim
3. Jena TDB
4. Virtuoso

on a single machine using datasets ranging from 10 million to 1 billion 
RDF triples (Explore and Business Intelligence query mixes).

In addition, it compares the performance of

1. BigOwlim
2. Virtuoso

on a cluster of 8 machines using datasets ranging from 10 billion to 150 
billion RDF triples (Explore and Business Intelligence query mixes).

The results of the experiment are found at

http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/results/V7/

I think that the results are quite impressive and demonstrate that 
SPARQL stores got a lot more mature over the last years.

A year ago, many RDF stores still had problems with the SPARQL 1.1 
constructs GROUP BY and COUNT() and were thus not able to execute the 
Business Intelligence query mix. Now, all systems pass this test and 
some of the systems show an impressive performance on grouping and 
aggregating the data.

The 150 billion triples experiment has shown that given proper hardware, 
it is possible to run analytical queries on amounts of data that are 
beyond most (all?) of today's use cases: The whole LOD Cloud [1] is 
estimated to consist only of 31 billion triples; the RDFa, Microdata and 
Microformat dataset extracted by the WebDataCommons [2] project from 3 
billion HTML pages only consists of 7.3 billion triples. So, 150 billion 
triples leave quite some room for the further growth of structured data 
on the Web ;-)

More information about the Berlin SPARQL benchmark, the exact 
specification of the benchmark query mixes, as well as results from 
previous benchmarking experiments are found at

http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/

Lots of thanks to Peter Boncz  and Minh-Duc Pham who conducted the new 
experiment as part of the EU project LOD2 and have provided their 
results for being published on the BSBM website.

Cheers,

Chris

[1] http://lod-cloud.net/state/
[2] http://www.webdatacommons.org/

Received on Monday, 29 April 2013 11:54:59 UTC