Re: RFC: Berlin SPARQL Benchmark

Chris Bizer wrote:
> SPARQL query language and the SPARQL protocol are implemented by a 
> growing number of storage systems and are used within enterprise and 
> open web settings. As SPARQL is taken up by the community there is a 
> growing need for benchmarks to compare the performance of storage 
> systems that expose SPARQL endpoints via the SPARQL protocol.
> 
> We have been working over the last week on such a benchmark called the 
> Berlin SPARQL Benchmark (BSBM). 

I ran the benchmark against my SemWeb .NET library [1] (whose SPARQL 
engine is a fork of the work of Ryan Levering's GSoC project a few years 
back). Instructions for setting up the benchmark are here [2] (and 
turned out to be a good example for how to set up a SPARQL endpoint 
using the library, backed with your SQL database of choice (in this case 
MySQL).)

For full disclosure, I had to correct a few bugs in the library before 
all of the queries in the benchmark ran through OK. These are listed at [2].

Also I have some concerns. First, I am not 100% sure if the results of 
my library are actually correct. Query 4 seemed to always return no 
results. Second, queries are largely translated into SQL, and there is a 
good deal of caching going on at the level of MySQL. The benchmark 
results then are saying a lot about the best-case run time, and indicate 
something about the overhead of SPARQL processing, but may not indicate 
general use performance.

Benchmark results reported below are for my desktop: Intel Core2 Duo at 
3.00GHz, 2 GB RAM, 32bit Ubuntu 8.04 on Linux 2.6.24-19-generic, Java 
1.6.0_06 for the benchmark tools, and Mono 1.9.1. This seems roughly 
comparable to the machine used in the BSBM.

Load time (in seconds and triples/sec) is reported below for two of the 
data set sizes.

        1M    25M
Time (sec)    224  16129
triples/sec  4441   1544

For comparison, load time for the 1M data set was 224 seconds. This is 
about double-to-2.5 times (worse) the time of Jena SDB (Hash) with MySQL 
over Joseki3 (117s) and Virtuoso Open-Source Edition v5.0.6 and v5.0.7 
(87s), as reported in the BSBM results. For the larger 25M dataset, the 
load time at 4.5 hours was only 1.2 times slower than Jena SDB but 1.7 
times faster than Sesame over Tomcat. (But, again, the machines were 
different.)

Results for query execution are reported below. AQET (Average Query 
Execution Time, in seconds) is reported below for each of the queries 
for different data set sizes. The results were roughly comparable again 
to Jena and Virtuoso. But, again, the three caveats above are worth 
restating: the query results are not validated to be known to be 
correct, there is significant caching, and the machine was different 
than the machine used in BSBM.

  1M    25M
Query 1  0.019184   0.049200
Query 2  0.051187   0.048590
Query 3  0.030508   0.079187
Query 4  0.032693   0.075603
Query 5  0.172283   0.342828
Query 6  0.102105   3.277656
Query 7  0.256491   1.108414
Query 8  0.175357   0.572258
Query 9  0.059674   0.088451
Query 10  0.089215   0.322246

[1] http://razor.occams.info/code/semweb
[2] http://razor.occams.info/code/semweb/semweb-current/doc/bsbm.html

-- 
- Josh Tauberer

http://razor.occams.info

"Yields falsehood when preceded by its quotation!  Yields
falsehood when preceded by its quotation!" Achilles to
Tortoise (in "Godel, Escher, Bach" by Douglas Hofstadter)

Received on Sunday, 10 August 2008 12:59:26 UTC