Role of RDF on the Web and within enterprise applications. was: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL

Hi Orri,

> It is my feeling that RDF has a dual role:  
> 1. interchange format:  This is like what XML does, except that RDF has
more semantics and expressivity.  
> 2: Database storage format for cases where data must be integrated and is
too heterogenous to easily 
> fall into one relational schema.  This is for example the case in the open
web conversation and social 
> space.  The first case is for mapping, the second for warehousing.   
> Aside this, there is potential for more expressive queries through  the
query language dealing with
> inferencing, like  subclass/subproperty/transitive etc.  These do not go
very well with SQL views.

I cannot agree more with what you say :-)

We are seeing the first RDF use case emerge within initiatives like the
Linking Open Data effort, where beside of being more expressive, RDF is also
playing its strength to provide for data links between record in different
databases.

Talking with people from industry, I get the feeling that also more and more
people understand the second use case and that RDF is increasingly used as a
technology for something like "poor man's data integration". You don't have
to spend a lot of time and money one designing a comprehensive data
warehouse. You just throw data having different schemata from different
sources together and instantly get the benefit that you can browse and query
the data and that you have proper provenance tracking (using Named Graphs).
Depending on how much data integration you need, you then start to apply
some identity resolution and schema mapping techniques. We have been talking
to some pharma and media companies that do data warehousing for years and
they all seam to be very interested in this quick and dirty approach.

For both use cases, inferencing is a nice add-on but not essential. Within
the first use case, inferencing usually does not work as data published by
various autonomous sources tends to be to dirty for reasoning engines.

Cheers,

Chris


-----Ursprüngliche Nachricht-----
Von: semantic-web-request@w3.org [mailto:semantic-web-request@w3.org] Im
Auftrag von Orri Erling
Gesendet: Dienstag, 30. September 2008 00:16
An: 'Seaborne, Andy'; 'Story Henry'
Cc: semantic-web@w3.org
Betreff: RE: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena
TDB, D2R Server, and MySQL


>From Henry Story:

>
>
> As a matter of interest, would it be possible to develop RDF stores
> that optimize the layout of the data by analyzing the queries to the
> database? A bit like a Java Just In Time compiler analyses the usage
> of the classes in order to decide how to optimize the compilation.

>From Andy Seaborne:

On a similar note, by mining the query logs it would be possible to create
parameterised queries and associated plan fragments without the client
needing to notify the server of the templates.  Couple with automatically
calculating possible materialized views or other layout optimizations, the
poor, overworked client application writer doesn't get brought into
optimizing the server.

        Andy

>
 
Orri here:

With the BSBM workload, using parametrized queries as a small scale saves
roughly 1/3 of the execution time.  It is possible to remember query plans
and to notice if the same query text is submitted with only changes in
literal values.  If the first query ran quickly, one may presume the query
with substitutions will also run quickly.  There are of course exceptions.
But detecting these will mean running most of the optimizer cost model and
will eliminate any benefit from caching.


The other optimizations suggested have a larger upside but are far harder.  
I would say that if we have a predictable workload, then mapping 
relational to RDF is a lot easier than expecting the DBMS to figure out
materialized views to do the same.  If we do not have a predictable
workload, then making too many materialized views based on transient usage
patterns is a large downside because it grows the database, meaning less
working set.  The difference between in memory random access and a random
access with disk is about 5000 times.  Plus there is a high cost to making
the views, thus a high penalty for wrong guess.  And    if it is hard enough
to figure out where a query plan goes wrong with a given schema, it is
harder still to figure it out with a schema that morphs by itself.

In the RDB world, for example Oracle recommends saving optimizer statistics
from the  test  environment and using these in the production environment
just so the optimizer does not get creative.  Now this is the  essence of
wisdom for OLTP but we are not talking OLTP with RDF. 

If there is a history of usage and this history is steady and the dba can
confirm it as being a representative sample, then automatic materializing
of joins is a real  possibility.  Doing this spontaneously would lead to
erratic response times, though.  For anything online, the accent is more on
predictable throughput than peak throughput.  

The BSBM query mix does lend itself quite well to automatic materialization
but with this, one would not normally do the representation as RDF to begin
with, the workload being so typically relational.  Publishing any ecommerce
database as RDF is of course good but then mapping is the simpler and more
predictable route.  

It is my feeling that RDF has a dual role:  1. interchange format:  This is
like what XML does, except that RDF has more semantics and expressivity.  2:
Database storage format for cases where data must be integrated and is too
heterogenous to easily fall into one relational schema.  This is for example
the case in the open web conversation and social space.  The first case is
for mapping, the second for warehousing.   Aside this, there is potential
for more expressive queries through  the query language dealing with
inferencing, like  subclass/subproperty/transitive etc.  These do not go
very well with SQL views.



Orri

Received on Tuesday, 30 September 2008 07:40:56 UTC