- From: Orri Erling <erling@xs4all.nl>
- Date: Fri, 19 Sep 2008 23:12:38 +0200
- To: "'Daniel Schwabe'" <dschwabe@inf.puc-rio.br>, "'Chris Bizer'" <chris@bizer.de>
- Cc: <semantic-web@w3.org>, <public-lod@w3.org>
> ... > It is interesting to see: > > ... > 4. that the fastest RDF store is still 7 times slower than a > relational database. Has has there been any analysis on whether there is a *fundamental* reason for such performance difference? Or is it simply a question of "maturity" - in other words, relational db technology has been around for a very long time and is very mature, whereas RDF implementations are still quite recent, so this gap will surely narrow ...? Cheers D Dan, All This is a very complex subject. I will offer some analysis below, but this I fear will only raise further questions. This is not the end of the road, far from it. In the BSBM case, we first note that the relational representation, in this case with Virtuoso, is about 4x more space efficient than the triples representation. This translates to running in memory in cases where the triples representation would go to disk. In the time of getting one page from disk, let's say 5ms, one can get 1000 random om memory. With Virtuoso 6, the ratio is somewhat more advantageous to triples. Still, a relational row store, such as MySQL or Virtuoso or pretty much any other RDBMS, except for the recent crop of column stores, of which we speak later, can store a non-indexed dependent column in the space it takes to store the data. Not everything is a triple and not everything gets indexed multiple ways, from 2 to 6 indices, as with triples. Let us further note that the BSBM report does not really define steady state. This is our (OpenLink) principal criticism of the process. The TPC (Transaction Processing Performance Council) benchmarks make it a point to eliminate nondeterminism coming from OS disk cache. For OLTP, we run for half an hour first to see that the cache is filled and then measure for another half hour. For analytics, the benchmark set is much larger than memory and the run starts with a full read through of the biggest tables, which eliminates any randomness from OS disk caching. Plus there may be rules for switching off the power but I would have to check the papers to be sure of this. Now, the BSBM data sets are primarily in memory. However, if we start with a cold cache at 100M scale with Virtuoso relational, , the first run is 20 times slower than the second run with the identical queries. If we shut down the server and redo the identical run , then the performance is about the same than the faster run because the OS still caches the disk pages. So, specially at larger scales, the BSBM test process simply must ensure steady state for whatever rates are reported. The easiest way to do this is to have a warm-up that is scale factor / 10 query mixes and not a constant of 32 query mixes, like with the reported results Virtuoso SPARQL to SQL mapping performs slower than Virtuoso SQL primarily because of the increased complexity of query compilation. However, the numbers published may underestimate this difference because of not running with a cache in steady state. In other words, there is disk latency which penalizes both equally while this disk latency would have vanished with another 5 to 10 minutes of running. But let us talk about triples vs. rows. The BSBM workload typically retrieves multiple dependent attributes of a single key. If these attributes are all next to each other, as in a relational row store, then we have a constant time for the extra attribute instead of a log of the database size. This favors RDBMS's. As mentioned before, there are also columnar RDBMS's, specially for analytics workloads. These do not have related attributes next to each other but they can play tricks relying on dense spacing of row numbers, locality of reference, compression of homogenous data, and sparse indices, which are not as readily applicable to a more unpredicttable RDF workload. This is complex and we do not have space to go deeper into this here. We have considered these, as we naturally have contemplated making a column store adaptation of Virtuoso. We may yet make one. Then there is the element of dissimilar semantics of SPARQL and SQL. Q6, which is basically a SQL LIKE full table scan, is specially unfair to triple stores. Even in the SQL world, this would be done using a text index in any halfway serious RDBMS, since they all have a full text predicate. This hurts RDF seriously but inconveniences SQL somewhat less because of locality of reference and the fact that LIKE has a sinppler semantic than regexp. Further, when the scale factor goes to infinity, the ratio of q6 over all queries goes to unity. In other words, the composition of the metric is not scale independent: If the scale is large, Q6 is the only thing that matters and is furthermore a thing that really shows triples at their worstt. This is recognized by the authors and addressed by dropping Q6 out of the metric in some cases but this is still not entirely satisfactory since it confuses the metric. The TPC experience suggests that a single benchmark must produce a single metric plus a separate full disclosure report. The best would be to use a text index with systems that support one and put them in a separate table and make another table with all systems but with a metric with no Q6. The metrics could be labeled BSBM basic and BSBM extended or something of the sort, since there already are results with and without Q6. Dropping a text lookup altogether is not per se good since this is such a universal operation. Allow using a text index when there is one and if there isn't use another metric. Do not put full table scans into SPARQL benchmarks. A lesser item in the same direction is the use of describe, which is not commensurate between SPARQL and SQL and not even between SPARQL's. Still, this is not asserious as Q6 because this does not affect the composition of the metric as scale changes. Also, query compilation plays a role in BSBM, specially with RDB to RDF mapping. Using parametrized queries would be realistic. For example, even with SPARQL going to triples, we see a 30% cut in execution times if we use parametrized queries. Note that parametrized queries are not standard SPPARQL. The BSBM workload is the prototypical case of using parametrized queries however, certainly in the SQL world. Making and interpreting benchmarks is really complex. At large scales, it is a fair rule of thumb that whatever accesses less data will win. This is the point of column stores and can also work in favor of triples since triples can be indexed in a columnar fashion. At scales where most fits in memory, which is the case with the BSBM figures published, the row store will win because of a single lookup for multiple values. The greater the part of the row accessed by the typical query, the better the row store will do. BSBM is row friendly. Taking a small scale BSBM run, could we make triples outrun relational? I would say no. Compiling SQL has less choices to make, a shorter execution plan, less joins. At a larger scale, with unpredictable queries, triples could outrun relational because of uniform indexing and more adaptive working set (i.e. columnar access). BSBM is clearly a pro-relational use case. This is part of its point, since it allows comparing relational mapping and triples, which comparison is certainly valuable. But to get to its full value, this comparison should have aspects of integration. But this is a step on the way. BSBM is not the last word but it certainly is needed. Orri Hi -- Daniel Schwabe Tel:+55-21-3527 1500 r. 4356 Fax: +55-21-3527 1530 http://www.inf.puc-rio.br/~dschwabe Dept. de Informatica, PUC-Rio R. M. de S. Vicente, 225 Rio de Janeiro, RJ 22453-900, Brasil -- No virus found in this incoming message. Checked by AVG. Version: 7.5.526 / Virus Database: 270.7.0/1679 - Release Date: 9/18/2008 5:03 PM
Received on Friday, 19 September 2008 21:15:32 UTC