Re: TPF and DBMSes (was Re: Hydra and Shapes) from Kjetil Kjernsmo on 2014-11-24 (public-hydra@w3.org from November 2014)

From: Kjetil Kjernsmo <kjetil@kjernsmo.net>
Date: Mon, 24 Nov 2014 09:54:25 +0100
To: public-hydra@w3.org
Message-ID: <3077414.jpBD7z2ARa@owl>

On Sunday 23. November 2014 10.56.24 Ruben Verborgh wrote:
> It really depends on the update frequency of the datasets.
> Some of the most referenced datasets in the SemWeb are static,
> like the various DBpedia versions we all know very well,
> and those never change once created (not talking about Live).
> Hence, we are able to host them through the HDT compressed triple
> format, which gives excellent performance for those cases,
> far better than what I've seen any DBMS do.

Well, while you are right in that you post some impressive figures, I'm not so sure that's not mainly an 
artifact of the benchmarks. :-) You'd expect HDT to be fast for unbound subjects. 

Fast backends are nice, but SPARQLES now sports 80 ms response times for their SPARQL queries, against 
my pure Perl endpoint:
http://sparqles.okfn.org/endpoint?uri=http%3A%2F%2Fdata.lenka.no%2Fsparql
but that's obviously due to the query not going all the way down to the SPARQL endpoint, but is served 
from DDR2 main memory from my Celeron-powered Varnish cache. :-)

I suspect you would need materialization to acheive similar numbers, which divides the problem space into 
two parts: The queries that you know people will do, and the ones that you don't know. Update frequencies, 
if we talk several seconds, are not all that important, if you have thousands of identical queries within 10 
seconds, caching is still of paramount importance.

So, the question is how you do materalization for the queries that you do know people will run. In the vast 
majority of cases, the way you solve this is a having a list of URLs on a host outside of you cache, and you 
signal wget to visit this list when you update. Perhaps you need something slightly more sophisticated if you 
update just parts of the dataset, but this is two one-liners, it couldn't be done any simpler. In this case, the 
backend performance isn't all that important as the above example shows, so, you might claim that you 
can do it faster than any DBMS, but that'd be premature optimization, IMHO.

To respond to arbitrary queries that is hard to predict, it is a different story, but at that point, I'm not quite 
likely to concede that the DBMS isn't nice to have just yet, also for performance reasons. :-)

Cheers,

Kjetil

Received on Monday, 24 November 2014 08:55:43 UTC