- From: Orri Erling <erling@xs4all.nl>
- Date: Tue, 30 Sep 2008 00:16:02 +0200
- To: "'Seaborne, Andy'" <andy.seaborne@hp.com>, "'Story Henry'" <henry.story@bblfish.net>
- Cc: <semantic-web@w3.org>
>From Henry Story: > > > As a matter of interest, would it be possible to develop RDF stores > that optimize the layout of the data by analyzing the queries to the > database? A bit like a Java Just In Time compiler analyses the usage > of the classes in order to decide how to optimize the compilation. >From Andy Seaborne: On a similar note, by mining the query logs it would be possible to create parameterised queries and associated plan fragments without the client needing to notify the server of the templates. Couple with automatically calculating possible materialized views or other layout optimizations, the poor, overworked client application writer doesn't get brought into optimizing the server. Andy > Orri here: With the BSBM workload, using parametrized queries as a small scale saves roughly 1/3 of the execution time. It is possible to remember query plans and to notice if the same query text is submitted with only changes in literal values. If the first query ran quickly, one may presume the query with substitutions will also run quickly. There are of course exceptions. But detecting these will mean running most of the optimizer cost model and will eliminate any benefit from caching. The other optimizations suggested have a larger upside but are far harder. I would say that if we have a predictable workload, then mapping relational to RDF is a lot easier than expecting the DBMS to figure out materialized views to do the same. If we do not have a predictable workload, then making too many materialized views based on transient usage patterns is a large downside because it grows the database, meaning less working set. The difference between in memory random access and a random access with disk is about 5000 times. Plus there is a high cost to making the views, thus a high penalty for wrong guess. And if it is hard enough to figure out where a query plan goes wrong with a given schema, it is harder still to figure it out with a schema that morphs by itself. In the RDB world, for example Oracle recommends saving optimizer statistics from the test environment and using these in the production environment just so the optimizer does not get creative. Now this is the essence of wisdom for OLTP but we are not talking OLTP with RDF. If there is a history of usage and this history is steady and the dba can confirm it as being a representative sample, then automatic materializing of joins is a real possibility. Doing this spontaneously would lead to erratic response times, though. For anything online, the accent is more on predictable throughput than peak throughput. The BSBM query mix does lend itself quite well to automatic materialization but with this, one would not normally do the representation as RDF to begin with, the workload being so typically relational. Publishing any ecommerce database as RDF is of course good but then mapping is the simpler and more predictable route. It is my feeling that RDF has a dual role: 1. interchange format: This is like what XML does, except that RDF has more semantics and expressivity. 2: Database storage format for cases where data must be integrated and is too heterogenous to easily fall into one relational schema. This is for example the case in the open web conversation and social space. The first case is for mapping, the second for warehousing. Aside this, there is potential for more expressive queries through the query language dealing with inferencing, like subclass/subproperty/transitive etc. These do not go very well with SQL views. Orri
Received on Monday, 29 September 2008 22:17:01 UTC