BSBM With Triples and Mapped Relational Data in Virtuoso from Orri Erling on 2008-08-06 (public-xg-rdb2rdf@w3.org from August 2008)

From: Orri Erling <erling@xs4all.nl>
Date: Wed, 6 Aug 2008 16:47:12 -0400
To: public-xg-rdb2rdf@w3.org
Message-ID: <E1KQpx0-0002tu-Qz@maggie.w3.org>
(Article also posted to blog [1])

The special contribution of the Berlin SPARQL Benchmark (BSBM) [2]
to the RDF world is to raise the question of doing OLTP with RDF.

Of course, here we immediately hit the question of comparisons with
relational databases. To this effect, BSBM also specifies a relational
schema and can generate the data as either triples or SQL inserts.

The benchmark effectively simulates the case of exposing an existing
RDBMS as RDF. OpenLink Software calls this RDF Views. Oracle is
beginning to call this semantic covers. The RDB2RDF XG [3], a W3C
incubator group, has been active in this area since Spring, 2008.


But why an OLTP workload with RDF to begin with?

We believe this is relevant because RDF promises to be the
interoperability factor between potentially all of traditional IS.
If data is online for human consumption, it may be online via a
SPARQL end-point as well. The economic justification will come
from discoverability and from applications integrating multi-source
structured data. Online shopping is a fine use case.

Warehousing all the world's publishable data as RDF is not our
first preference, nor would it be the publisher's. Considerations
of duplicate infrastructure and maintenance are reason enough.
Consequently, we need to show that mapping can outperform an RDF
warehouse, which is what we'll do here.


What We Got

First, we found that making the query plan took much too long [4]
in proportion to the run time. With BSBM this is an issue because
the queries have lots of joins but access relatively little data.
So we made a faster compiler and along the way retouched the cost
model a bit.

But the really interesting part with BSBM is mapping relational
data to RDF. For us, BSBM is a great way of showing that mapping
can outperform even the best triple store. A relational row store
is as good as unbeatable with the query mix. And when there is a
clear mapping, there is no reason the SPARQL could not be directly
translated.

If Chris Bizer et al launched the mapping ship, we will be the ones
to pilot it to harbor!

We filled two Virtuoso instances with a BSBM200000 data set, for
100M triples. One was filled with physical triples; the other was
filled with the equivalent relational data plus mapping to triples.
Performance figures are given in "query mixes per hour". (An update
or follow-on to this post will provide elapsed times for each test
run.)

With the unmodified benchmark we got:

   Physical Triples:      1297 qmph
   Mapped Triples:        3144 qmph

In both cases, most of the time was spent on Q6, which looks for
products with one of three words in the label. We altered Q6 to use
text index for the mapping, and altered the databases accordingly.
(There is no such thing as an e-commerce site without a text index,
so we are amply justified in making this change.)

The following were measured on the second run of a 100 query mix
series, single test driver, warm cache.

   Physical Triples:      5746 qmph
   Mapped Triples:        7525 qmph

We then ran the same with 4 concurrent instances of the test driver.
The qmph here is 400 / the longest run time.

   Physical Triples:     19459 qmph
   Mapped Triples:       24531 qmph

The system used was 64-bit Linux, 2GHz dual-Xeon 5130 (8 cores) with
8G RAM. The concurrent throughputs are a little under 4 times the
single thread throughput, which is normal for SMP due to memory
contention. The numbers do not evidence significant overhead from
thread synchronization.

The query compilation represents about 1/3 of total server side
CPU.  In an actual online application of this type, queries would
be parameterized, so the throughputs would be accordingly higher.
We used the StopCompilerWhenXOverRunTime = 1 option here to cut
needless compiler overhead, the queries being straightforward enough.


We also see that the advantage of mapping can be further increased
by more compiler optimizations, so we expect in the end mapping will
lead RDF warehousing by a factor of 4 or so.


Suggestions for BSBM

* Reporting Rules. The benchmark spec should specify a form for
  disclosure of test run data, TPC style. This includes things like
  configuration parameters and exact text of queries. There should
  be accepted variants of query text, as with the TPC.

* Multiuser operation. The test driver should get a stream number as
  parameter, so that each client makes a different query sequence.
  Also, disk performance in this type of benchmark can only be
  reasonably assessed with a naturally parallel multiuser workload.

* Add business intelligence. SPARQL has aggregates now, at least
  with Jena and Virtuoso, so let's use these. The BSBM business
  intelligence metric should be a separate metric off the same data.
  Adding synthetic sales figures would make more interesting queries
  possible. For example, producing recommendations like "customers
  who bought this also bought xxx."

* For the SPARQL community, BSBM sends the message that one ought to
  support parameterized queries and stored procedures. This would be
  a SPARQL protocol extension; the SPARUL syntax should also have a
  way of calling a procedure. Something like select proc (??, ??)
  would be enough, where ?? is a parameter marker, like ? in
  ODBC/JDBC.

* Add transactions.Especially if we are contrasting mapping vs.
  storing triples, having an update flow is relevant. In practice,
  this could be done by having the test driver send web service
  requests for order entry and the SUT could implement these as
  updates to the triples or a mapped relational store. This could
  use stored procedures or logic in an app server.


Comments on Query Mix

The time of most queries is less than linear to the scale factor. Q6
is an exception if it is not implemented using a text index. Without
the text index, Q6 will inevitably come to dominate query time as the
scale is increased, and thus will make the benchmark less relevant at
larger scales.


Next

We include the sources of our RDF view definitions and other material
for running BSBM with our forthcoming Virtuoso Open Source 5.0.8
release. This also includes all the query optimization work done for
BSBM. This will be available in the coming days.


- Orri





[1]
<http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1409>

[2]
<http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html>

[3] <http://www.w3.org/2005/Incubator/rdb2rdf/>

[4]
<http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1400>
Received on Wednesday, 6 August 2008 20:49:52 UTC