W3C home > Mailing lists > Public > public-lod@w3.org > August 2008

Re: BSBM With Triples and Mapped Relational Data in Virtuoso

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Thu, 07 Aug 2008 08:10:04 -0400
Message-ID: <489AE61C.9000002@openlinksw.com>
To: Chris Bizer <chris@bizer.de>
CC: public-lod@w3.org, "Orri Erling (by way of Ted Thibodeau Jr)" <erling@xs4all.nl>

Chris Bizer wrote:
> Hi Orri and Ivan,
>> Consequently, we need to show that mapping can outperform an RDF
>> warehouse, which is what we'll do here.
> Yes. I was already guessing for a while that SPARQL against RDF-mapped 
> relational DBs should be faster than SPARQL against triple stores. 
> With D2R Server it turned out that some queries are much faster, but 
> also that D2R Server really performas bad on others (especially Q5). 
> The bad performance with some queries was no surprise as there is 
> still lots of room for improvements in D2R Servers SPARQL-to-SQL query 
> rewriting algorithm.
> Another observation was that the distance between native RDF stores 
> and RDF-mapped RDBs increases with dataset size.
> So it looks like that if you have more than 50M triples and schemata 
> that somehow fits into a RDB, you should go for the RDF solution.
>> We also see that the advantage of mapping can be further increased
>> by more compiler optimizations, so we expect in the end mapping will
>> lead RDF warehousing by a factor of 4 or so.
> Being able to show a factor 4 on all dataset sizes would be very 
> interesting!
>> Suggestions for BSBM
>> * Reporting Rules. The benchmark spec should specify a form for
>>  disclosure of test run data, TPC style. This includes things like
>>  configuration parameters and exact text of queries. There should
>>  be accepted variants of query text, as with the TPC.
> We have started collecting stuff that should go into the 
> full-disclosure report in section 6.2 of the benchmark spec 
> http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html#reporting 
> but did not had the time to define a proper format for this yet (I 
> guess we will have some XML format). We will define the format for 
> version 2 of the benchmark, which will be released together with 
> updated results in about 3-4 weeks.
> If you think that there is something missing from this list, please 
> let us know.
>> * Multiuser operation. The test driver should get a stream number as
>>  parameter, so that each client makes a different query sequence.
>>  Also, disk performance in this type of benchmark can only be
>>  reasonably assessed with a naturally parallel multiuser workload.
> Yes. This is already on our todo list and will also be part of the 
> next release.
>> * Add business intelligence. SPARQL has aggregates now, at least
>>  with Jena and Virtuoso, so let's use these. The BSBM business
>>  intelligence metric should be a separate metric off the same data.
>>  Adding synthetic sales figures would make more interesting queries
>>  possible. For example, producing recommendations like "customers
>>  who bought this also bought xxx."
> Hmm, yes and no. I would love to extend the benchmark with a BI query 
> mix, but aggregates are not yet an official part of SPARQL. Our goal 
> with the benchmark was to define a tool to compare stores that 
> implement the current SPARQL specs but not to fix these specs. Thus, 
> we stayed in the bounderies of the current spec and of couse ran into 
> all the know problems of SPARQL (no aggregates, no free-text search, 
> no proper negation). All these things were discussed at the SPARQL 2 
> BOF at WWW2008 and I hope that they are all on Ivan Herman's list for 
> the charter of a new SPARQL WG.

I don't think we have to wait for SPARQL to catch up with SQL if we want 
to have enterprises take RDBMS to RDF Views seriously. 

We've already done the TPC-H Benchmark as the QA and Testbed for SPARQL 
Aggregates atop Native and Virtualized SQL Data Sources. Our results are 
very exciting and they prove that existing RDBMS data shouln't be moved 
wholesale to RDF in order to experience and exploit the virtues of 
Linked Data.

Since we've already implemented TPC-H, I don't see why we can't 
contribute an enhancement to this benchmark. At the very least Virtuoso 
and SDB (for Jena) exist as initial exemplars.

Let's innovate at every turn.  I am sure you've noticed that OpenLink 
and HP continue to accelerate SPARQL improvements with enterprise 
viability in mind.  The recent delivery of SPARUL is another example.
>> * For the SPARQL community, BSBM sends the message that one ought to
>>  support parameterized queries and stored procedures. This would be
>>  a SPARQL protocol extension; the SPARUL syntax should also have a
>>  way of calling a procedure. Something like select proc (??, ??)
>>  would be enough, where ?? is a parameter marker, like ? in
> Also a great idea and maybe something Ivan does not have on his list yet.
This was on his todo in 2005 :-) Remember, Virtuoso is a SQL ORDBMS 
engine, so we are reconciling from SQL to SPARQL rather than SPARQL to 
SQL re. requisite functionality.
>> * Add transactions.Especially if we are contrasting mapping vs.
>>  storing triples, having an update flow is relevant. In practice,
>>  this could be done by having the test driver send web service
>>  requests for order entry and the SUT could implement these as
>>  updates to the triples or a mapped relational store. This could
>>  use stored procedures or logic in an app server.
> In principle yes, but we also wanted to design a benchmark that some 
> current RDF stores are able to run.
> If I look at the current data load times of the SUTs  I'm not so sure 
> that they like update streams ;-)
As per comment re. Aggregates, this are little tweaks that should exist 
as options.  Innovation by consensus doesn't really work as you know too 
well. Imagine if you had to build this really nice benchmark on a 
consensus basis?
> But I agree that update streams are clearly something that we should 
> have in the future.
Yes, the more realistic the benchmark the better for everyone (vendors 
and users).
>> Comments on Query Mix
>> The time of most queries is less than linear to the scale factor. Q6
>> is an exception if it is not implemented using a text index. Without
>> the text index, Q6 will inevitably come to dominate query time as the
>> scale is increased, and thus will make the benchmark less relevant at
>> larger scales.
> You are right and it is again a problem of us trying to stay in the 
> bounderies of the SPARQL spec.
> No sane person would use a regex for this kind of free-text search, 
> but SPARQL only offers the regex function and nothing else.
> Maybe we should be a bit less strict here and allow proprietary 
> variants of Q6 until SPARQL got fixed.

Let's keep it real (as per sentiment expressed in my responses above).

>> Next
>> We include the sources of our RDF view definitions and other material
>> for running BSBM with our forthcoming Virtuoso Open Source 5.0.8
>> release. This also includes all the query optimization work done for
>> BSBM. This will be available in the coming days.
> Great. We are looking forward to rerun the benchmark with the new 
> virtuoso release on our box. Especially being able to confirm the 
> factor 4 advance of RDF-mapped RDFs against RDF stores would be fun ;-)
Big time! SQL Compilers and Optimizations have a come a long way over 
the years. This is a sure way to awaken many of the SQL DBMS behemoths 
out there :-)

Chris & Andreas:

Again, this is a major contribution to the broad realm of Enterprise 
Linked Data, really good job!  

> Cheers
> Chris and Andreas
>> - Orri
>> [1]
>> <http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1409> 
>> [2]
>> <http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html> 
>> [3] <http://www.w3.org/2005/Incubator/rdb2rdf/>
>> [4]
>> <http://www.openlinksw.com/dataspace/oerling/weblog/Orri%20Erling%27s%20Blog/1400> 



Kingsley Idehen	      Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO 
OpenLink Software     Web: http://www.openlinksw.com
Received on Thursday, 7 August 2008 12:10:44 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:15:52 UTC