Re: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL from Kingsley Idehen on 2008-09-25 (semantic-web@w3.org from September 2008)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Thu, 25 Sep 2008 12:18:41 -0400
To: Chris Bizer <chris@bizer.de>
CC: 'Paul Gearon' <gearon@ieee.org>, semantic-web@w3.org, public-lod@w3.org
Message-ID: <48DBB9E1.2010901@openlinksw.com>
Chris Bizer wrote:
> Hi Kingsley and Paul,
>
> Yes, I completely agree with you that different storage solutions fit
> different use cases and that one of the main strengths of the RDF data model
> is its flexibility and the possibility to mix different schemata.
>
> Nevertheless, it think it is useful to give application developers an
> indicator about what performance they can expect when they choose a specific
> architecture, which is what the benchmark is trying to do.
>
> We plan to run the benchmark again in January and it would be great to also
> test Tucana/Kowari/Mulgara in this run.
>
> As the performance of RDF stores is constantly improving, let's also hope
> that the picture will not look that bad for them anymore then.
>
> Cheers,
>
> Chris
>
>   
Chris,

Yes, but the user profile has to be a little clearer. If you separate 
the results in the narrative you achieve the goal. You can use SQL 
numbers as a sort of benchamark if you clearly explain the nature skew 
that SQL enjoys due to the nature of the schema.
> We plan to run the benchmark again in January and it would be great to 
> also
> test Tucana/Kowari/Mulgara in this run.
>
> As the performance of RDF stores is constantly improving, let's also hope
> that the picture will not look that bad for them anymore then.
>   
But at the current time, there is no clear sense of what better means 
:-) What's the goal?

What I fundamentally take from the benchmarks are the following:

1. Native RDF and RDF Views/Mapper scalability is becoming less of an 
issue (of course depending on your choice of product) and we are already 
at the point where this technology can be used for real-world solutions 
that have enterprise level scalability demands and expectations

2. It's impractical to create RDF warehouses from a existing SQL Data 
Sources when you can put RDF Views / Wrappers in front of the SQL Data 
Sources (SQL cost optimization technology has evolved significantly over 
the years across RDBMS engines).


And Yes, I would also like to see Mulgara and others RDF Stores in the 
next round of benchmarks :-)

Kingsley
> -----Ursprüngliche Nachricht-----
> Von: public-lod-request@w3.org [mailto:public-lod-request@w3.org] Im Auftrag
> von Kingsley Idehen
> Gesendet: Mittwoch, 24. September 2008 20:57
> An: Paul Gearon
> Cc: semantic-web@w3.org; public-lod@w3.org
> Betreff: Re: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena
> TDB, D2R Server, and MySQL
>
>
> Paul Gearon wrote:
>   
>> On Mon, Sep 22, 2008 at 3:47 AM, Eyal Oren <eyal@cs.vu.nl> wrote:
>>   
>>     
>>> On 09/19/08/09/08 23:12 +0200, Orri Erling wrote:
>>>     
>>>       
>>>>> Has has there been any analysis on whether there is a *fundamental*
>>>>> reason for such performance difference? Or is it simply a question of
>>>>> "maturity"; in other words, relational db technology has been around
>>>>>           
> for a
>   
>>>>> very long time and is very mature, whereas RDF implementations are
>>>>>           
> still
>   
>>>>> quite recent, so this gap will surely narrow ...?
>>>>>         
>>>>>           
>>>> This is a very complex subject.  I will offer some analysis below, but
>>>> this I fear will only raise further questions.  This is not the end of
>>>>         
> the
>   
>>>> road, far from it.
>>>>       
>>>>         
>>> As far as I understand, another issue is relevant: this benchmark is
>>> somewhat unfair as the relational stores have one advantage compared to
>>>       
> the
>   
>>> native triple stores: the relational data structure is fixed (Products,
>>> Producers, Reviews, etc with given columns), while the triple
>>>       
> representation
>   
>>> is generic (arbitrary s,p,o).
>>>     
>>>       
>> This point has an effect on several levels.
>>
>> For instance, the flexibility afforded by triples means that objects
>> stored in this structure require processing just to piece it all
>> together, whereas the RDBMS has already encoded the structure into the
>> table. Ironically, this is exactly the reason we
>> (Tucana/Kowari/Mulgara) ended up building an RDF database instead of
>> building on top of an RDBMS: The flexibility in table structure was
>> less efficient that a system that just "knew" it only had to deal with
>> 3 columns. Obviously the shape of the data (among other things)
>> dictates what it is the better type of storage to use.
>>
>> A related point is that processing RDF to create an object means you
>> have to move around a lot in the graph. This could mean a lot of
>> seeking on disk, while an RDBMS will usually find the entire object in
>> one place on the disk. And seeks kill performance.
>>
>> This leads to the operations used to build objects from an RDF store.
>> A single object often requires the traversal of several statements,
>> where the object of one statement becomes the subject of the next.
>> Since the tables are typically represented as
>> Subject/Predicate/Object, this means that the main table will be
>> "joined" against itself. Even RDBMSs are notorious for not doing this
>> efficiently.
>>
>> One of the problems with self-joins is that efficient operations like
>> merge-joins (when they can be identified) will still result in lots of
>> seeking, since simple iteration on both sides of the join means
>> seeking around in the same data. Of course, there ARE ways to optimize
>> some of this, but the various stores are only just starting to get to
>> these optimizations now.
>>
>> Relational databases suffer similar problems, but joins are usually
>> only required for complex structures between different tables, which
>> can be stored on different spindles. Contrast this to RDF, which needs
>> to do do many of these joins for all but the simplest of data.
>>
>>   
>>     
>>> One can question whether such flexibility is relevant in practice, and if
>>> so, one may try to extract such structured patterns from data on-the-fly.
>>> Still, it's important to note that we're comparing somewhat different
>>>       
> things
>   
>>> here between the relational and the triple representation of the
>>>       
> benchmark.
>   
>>>     
>>>       
>> This is why I think it is very important to consider the type of data
>> being stored before choosing the type of storage to use. For some
>> applications an RDBMS is going to win hands down every time. For other
>> applications, an RDF store is definitely the way to go. Understanding
>> the flexibility and performance constraints of each is important. This
>> kind of benchmarking helps with that. It also helps identify where RDF
>> databases need to pick up their act.
>>
>> Regards,
>> Paul Gearon
>>
>>
>>   
>>     
> Paul,
>
> You make valid points, the problem here is that the benchmark has been 
> released without enough clarity about it's prime purpose. To even 
> compare RDF Quads Stores with an RDBMS engine when the schema is 
> Relational in itself is kinda twisted.
>
> The role of mappers (DR2Q & Virtuoso RDF Views) for instance,  should 
> have been made much clearer, maybe in separate results tables. I say 
> this because these mappers offer different approaches to projecting 
> RDBMS based data in RDF Linked Data form, on the fly, and their purpose 
> in this benchmark is all about raw performance and scalability as it 
> relates to following RDF Linked Data generation and deployment conditions:
>
> 1. Schema is Relational
> 2. RDF warehouse is impractical
>
> As I am sure you know, we could invert this whole benchmark "Open World" 
> style, and then bring RDBMS engines to their knees by incorporating 
> SPARQL query patterns comprised of ?p's and subclasses .
>
> To conclude, the quad store numbers should simply be a conparisons of 
> the quad stores themselves, and not the quad stores vs the mappers or 
> native SQL. This clarification really needs to make it's way into the 
> benchmark narrative.
>
>
>   


-- 


Regards,

Kingsley Idehen	      Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO 
OpenLink Software     Web: http://www.openlinksw.com
Received on Thursday, 25 September 2008 16:19:48 UTC