- From: Chris Bizer <chris@bizer.de>
- Date: Tue, 30 Sep 2008 09:40:13 +0200
- To: "'Orri Erling'" <erling@xs4all.nl>, "'Seaborne, Andy'" <andy.seaborne@hp.com>, "'Story Henry'" <henry.story@bblfish.net>
- Cc: <semantic-web@w3.org>, <public-lod@w3.org>
Hi Orri, > It is my feeling that RDF has a dual role: > 1. interchange format: This is like what XML does, except that RDF has more semantics and expressivity. > 2: Database storage format for cases where data must be integrated and is too heterogenous to easily > fall into one relational schema. This is for example the case in the open web conversation and social > space. The first case is for mapping, the second for warehousing. > Aside this, there is potential for more expressive queries through the query language dealing with > inferencing, like subclass/subproperty/transitive etc. These do not go very well with SQL views. I cannot agree more with what you say :-) We are seeing the first RDF use case emerge within initiatives like the Linking Open Data effort, where beside of being more expressive, RDF is also playing its strength to provide for data links between record in different databases. Talking with people from industry, I get the feeling that also more and more people understand the second use case and that RDF is increasingly used as a technology for something like "poor man's data integration". You don't have to spend a lot of time and money one designing a comprehensive data warehouse. You just throw data having different schemata from different sources together and instantly get the benefit that you can browse and query the data and that you have proper provenance tracking (using Named Graphs). Depending on how much data integration you need, you then start to apply some identity resolution and schema mapping techniques. We have been talking to some pharma and media companies that do data warehousing for years and they all seam to be very interested in this quick and dirty approach. For both use cases, inferencing is a nice add-on but not essential. Within the first use case, inferencing usually does not work as data published by various autonomous sources tends to be to dirty for reasoning engines. Cheers, Chris -----Ursprüngliche Nachricht----- Von: semantic-web-request@w3.org [mailto:semantic-web-request@w3.org] Im Auftrag von Orri Erling Gesendet: Dienstag, 30. September 2008 00:16 An: 'Seaborne, Andy'; 'Story Henry' Cc: semantic-web@w3.org Betreff: RE: Berlin SPARQL Benchmark V2 - Results for Sesame, Virtuoso, Jena TDB, D2R Server, and MySQL >From Henry Story: > > > As a matter of interest, would it be possible to develop RDF stores > that optimize the layout of the data by analyzing the queries to the > database? A bit like a Java Just In Time compiler analyses the usage > of the classes in order to decide how to optimize the compilation. >From Andy Seaborne: On a similar note, by mining the query logs it would be possible to create parameterised queries and associated plan fragments without the client needing to notify the server of the templates. Couple with automatically calculating possible materialized views or other layout optimizations, the poor, overworked client application writer doesn't get brought into optimizing the server. Andy > Orri here: With the BSBM workload, using parametrized queries as a small scale saves roughly 1/3 of the execution time. It is possible to remember query plans and to notice if the same query text is submitted with only changes in literal values. If the first query ran quickly, one may presume the query with substitutions will also run quickly. There are of course exceptions. But detecting these will mean running most of the optimizer cost model and will eliminate any benefit from caching. The other optimizations suggested have a larger upside but are far harder. I would say that if we have a predictable workload, then mapping relational to RDF is a lot easier than expecting the DBMS to figure out materialized views to do the same. If we do not have a predictable workload, then making too many materialized views based on transient usage patterns is a large downside because it grows the database, meaning less working set. The difference between in memory random access and a random access with disk is about 5000 times. Plus there is a high cost to making the views, thus a high penalty for wrong guess. And if it is hard enough to figure out where a query plan goes wrong with a given schema, it is harder still to figure it out with a schema that morphs by itself. In the RDB world, for example Oracle recommends saving optimizer statistics from the test environment and using these in the production environment just so the optimizer does not get creative. Now this is the essence of wisdom for OLTP but we are not talking OLTP with RDF. If there is a history of usage and this history is steady and the dba can confirm it as being a representative sample, then automatic materializing of joins is a real possibility. Doing this spontaneously would lead to erratic response times, though. For anything online, the accent is more on predictable throughput than peak throughput. The BSBM query mix does lend itself quite well to automatic materialization but with this, one would not normally do the representation as RDF to begin with, the workload being so typically relational. Publishing any ecommerce database as RDF is of course good but then mapping is the simpler and more predictable route. It is my feeling that RDF has a dual role: 1. interchange format: This is like what XML does, except that RDF has more semantics and expressivity. 2: Database storage format for cases where data must be integrated and is too heterogenous to easily fall into one relational schema. This is for example the case in the open web conversation and social space. The first case is for mapping, the second for warehousing. Aside this, there is potential for more expressive queries through the query language dealing with inferencing, like subclass/subproperty/transitive etc. These do not go very well with SQL views. Orri
Received on Tuesday, 30 September 2008 07:41:04 UTC