RE: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL from Orri Erling on 2008-09-29 (semantic-web@w3.org from September 2008)

From: Orri Erling <erling@xs4all.nl>
Date: Tue, 30 Sep 2008 00:16:02 +0200
To: "'Seaborne, Andy'" <andy.seaborne@hp.com>, "'Story Henry'" <henry.story@bblfish.net>
Cc: <semantic-web@w3.org>
Message-Id: <200809292216.m8TMGKBZ038045@smtp-vbr11.xs4all.nl>

>From Henry Story:

>
>
> As a matter of interest, would it be possible to develop RDF stores
> that optimize the layout of the data by analyzing the queries to the
> database? A bit like a Java Just In Time compiler analyses the usage
> of the classes in order to decide how to optimize the compilation.

>From Andy Seaborne:

On a similar note, by mining the query logs it would be possible to create
parameterised queries and associated plan fragments without the client
needing to notify the server of the templates.  Couple with automatically
calculating possible materialized views or other layout optimizations, the
poor, overworked client application writer doesn't get brought into
optimizing the server.

        Andy

>
 
Orri here:

With the BSBM workload, using parametrized queries as a small scale saves
roughly 1/3 of the execution time.  It is possible to remember query plans
and to notice if the same query text is submitted with only changes in
literal values.  If the first query ran quickly, one may presume the query
with substitutions will also run quickly.  There are of course exceptions.
But detecting these will mean running most of the optimizer cost model and
will eliminate any benefit from caching.


The other optimizations suggested have a larger upside but are far harder.  
I would say that if we have a predictable workload, then mapping 
relational to RDF is a lot easier than expecting the DBMS to figure out
materialized views to do the same.  If we do not have a predictable
workload, then making too many materialized views based on transient usage
patterns is a large downside because it grows the database, meaning less
working set.  The difference between in memory random access and a random
access with disk is about 5000 times.  Plus there is a high cost to making
the views, thus a high penalty for wrong guess.  And    if it is hard enough
to figure out where a query plan goes wrong with a given schema, it is
harder still to figure it out with a schema that morphs by itself.

In the RDB world, for example Oracle recommends saving optimizer statistics
from the  test  environment and using these in the production environment
just so the optimizer does not get creative.  Now this is the  essence of
wisdom for OLTP but we are not talking OLTP with RDF. 

If there is a history of usage and this history is steady and the dba can
confirm it as being a representative sample, then automatic materializing
of joins is a real  possibility.  Doing this spontaneously would lead to
erratic response times, though.  For anything online, the accent is more on
predictable throughput than peak throughput.  

The BSBM query mix does lend itself quite well to automatic materialization
but with this, one would not normally do the representation as RDF to begin
with, the workload being so typically relational.  Publishing any ecommerce
database as RDF is of course good but then mapping is the simpler and more
predictable route.  

It is my feeling that RDF has a dual role:  1. interchange format:  This is
like what XML does, except that RDF has more semantics and expressivity.  2:
Database storage format for cases where data must be integrated and is too
heterogenous to easily fall into one relational schema.  This is for example
the case in the open web conversation and social space.  The first case is
for mapping, the second for warehousing.   Aside this, there is potential
for more expressive queries through  the query language dealing with
inferencing, like  subclass/subproperty/transitive etc.  These do not go
very well with SQL views.



Orri

Received on Monday, 29 September 2008 22:17:01 UTC