Re: Role of RDF on the Web and within enterprise applications. was: AW: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL from Kingsley Idehen on 2008-09-30 (semantic-web@w3.org from September 2008)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Tue, 30 Sep 2008 08:06:50 -0400
To: Chris Bizer <chris@bizer.de>
CC: 'Orri Erling' <erling@xs4all.nl>, "'Seaborne, Andy'" <andy.seaborne@hp.com>, 'Story Henry' <henry.story@bblfish.net>, semantic-web@w3.org, public-lod@w3.org
Message-ID: <48E2165A.80601@openlinksw.com>
Chris Bizer wrote:
> Hi Orri,
>
>   
>> It is my feeling that RDF has a dual role:  
>> 1. interchange format:  This is like what XML does, except that RDF has
>>     
> more semantics and expressivity.  
>   
>> 2: Database storage format for cases where data must be integrated and is
>>     
> too heterogenous to easily 
>   
>> fall into one relational schema.  This is for example the case in the open
>>     
> web conversation and social 
>   
>> space.  The first case is for mapping, the second for warehousing.   
>> Aside this, there is potential for more expressive queries through  the
>>     
> query language dealing with
>   
>> inferencing, like  subclass/subproperty/transitive etc.  These do not go
>>     
> very well with SQL views.
>
> I cannot agree more with what you say :-)
>
> We are seeing the first RDF use case emerge within initiatives like the
> Linking Open Data effort, where beside of being more expressive, RDF is also
> playing its strength to provide for data links between record in different
> databases.
>   

Chris,
> Talking with people from industry, I get the feeling that also more and more
> people understand the second use case and that RDF is increasingly used as a
> technology for something like "poor man's data integration". You don't have
> to spend a lot of time and money one designing a comprehensive data
> warehouse. You just throw data having different schemata from different
> sources together and instantly get the benefit that you can browse and query
> the data and that you have proper provenance tracking (using Named Graphs).
> Depending on how much data integration you need, you then start to apply
> some identity resolution and schema mapping techniques. We have been talking
> to some pharma and media companies that do data warehousing for years and
> they all seam to be very interested in this quick and dirty approach.
>   
"Quick & Dirty" is simply not how I would characterize this matter. I 
prefer to describe this as step 1 in a multi phased approach to RDF 
based data integration.
> For both use cases, inferencing is a nice add-on but not essential. Within
> the first use case, inferencing usually does not work as data published by
> various autonomous sources tends to be to dirty for reasoning engines.
>   
Inferencing is not a nice add-on, it is essential (in various degrees) 
once you get beyond the initial stages of heterogeneous data 
integration. As with all things, these matters are connected and 
inherently symbiotic: you can't inference without having something you 
want to reason about available in palatable form, which goes back to the 
phased approach I refer to above.

In my eyes, and experience, RDF is a powerful vehicle for implementing 
conceptual level data access that sits atop heterogeneous data sources. 
It's novelty comes from the platform independence that it injects into 
the data integration technology realm.


Kingsley
> Cheers,
>
> Chris
>
>
> -----Ursprüngliche Nachricht-----
> Von: semantic-web-request@w3.org [mailto:semantic-web-request@w3.org] Im
> Auftrag von Orri Erling
> Gesendet: Dienstag, 30. September 2008 00:16
> An: 'Seaborne, Andy'; 'Story Henry'
> Cc: semantic-web@w3.org
> Betreff: RE: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena
> TDB, D2R Server, and MySQL
>
>
> >From Henry Story:
>
>   
>> As a matter of interest, would it be possible to develop RDF stores
>> that optimize the layout of the data by analyzing the queries to the
>> database? A bit like a Java Just In Time compiler analyses the usage
>> of the classes in order to decide how to optimize the compilation.
>>     
>
> >From Andy Seaborne:
>
> On a similar note, by mining the query logs it would be possible to create
> parameterised queries and associated plan fragments without the client
> needing to notify the server of the templates.  Couple with automatically
> calculating possible materialized views or other layout optimizations, the
> poor, overworked client application writer doesn't get brought into
> optimizing the server.
>
>         Andy
>
>   
>  
> Orri here:
>
> With the BSBM workload, using parametrized queries as a small scale saves
> roughly 1/3 of the execution time.  It is possible to remember query plans
> and to notice if the same query text is submitted with only changes in
> literal values.  If the first query ran quickly, one may presume the query
> with substitutions will also run quickly.  There are of course exceptions.
> But detecting these will mean running most of the optimizer cost model and
> will eliminate any benefit from caching.
>
>
> The other optimizations suggested have a larger upside but are far harder.  
> I would say that if we have a predictable workload, then mapping 
> relational to RDF is a lot easier than expecting the DBMS to figure out
> materialized views to do the same.  If we do not have a predictable
> workload, then making too many materialized views based on transient usage
> patterns is a large downside because it grows the database, meaning less
> working set.  The difference between in memory random access and a random
> access with disk is about 5000 times.  Plus there is a high cost to making
> the views, thus a high penalty for wrong guess.  And    if it is hard enough
> to figure out where a query plan goes wrong with a given schema, it is
> harder still to figure it out with a schema that morphs by itself.
>
> In the RDB world, for example Oracle recommends saving optimizer statistics
> from the  test  environment and using these in the production environment
> just so the optimizer does not get creative.  Now this is the  essence of
> wisdom for OLTP but we are not talking OLTP with RDF. 
>
> If there is a history of usage and this history is steady and the dba can
> confirm it as being a representative sample, then automatic materializing
> of joins is a real  possibility.  Doing this spontaneously would lead to
> erratic response times, though.  For anything online, the accent is more on
> predictable throughput than peak throughput.  
>
> The BSBM query mix does lend itself quite well to automatic materialization
> but with this, one would not normally do the representation as RDF to begin
> with, the workload being so typically relational.  Publishing any ecommerce
> database as RDF is of course good but then mapping is the simpler and more
> predictable route.  
>
> It is my feeling that RDF has a dual role:  1. interchange format:  This is
> like what XML does, except that RDF has more semantics and expressivity.  2:
> Database storage format for cases where data must be integrated and is too
> heterogenous to easily fall into one relational schema.  This is for example
> the case in the open web conversation and social space.  The first case is
> for mapping, the second for warehousing.   Aside this, there is potential
> for more expressive queries through  the query language dealing with
> inferencing, like  subclass/subproperty/transitive etc.  These do not go
> very well with SQL views.
>
>
>
> Orri
>
>
>
>
>
>
>
>
>
>
>
>
>   


-- 


Regards,

Kingsley Idehen	      Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO 
OpenLink Software     Web: http://www.openlinksw.com
Received on Tuesday, 30 September 2008 12:07:37 UTC