Re: LOD Instance Update re. new Data Sets Loaded. from Yrjänä Rankka on 2009-03-06 (public-lod@w3.org from March 2009)

From: Yrjänä Rankka <ghard@openlinksw.com>
Date: Fri, 06 Mar 2009 11:06:33 +0100
To: Dan Brickley <danbri@danbri.org>
CC: Georgi Kobilarov <georgi.kobilarov@gmx.de>, Kingsley Idehen <kidehen@openlinksw.com>, Linked Data community <public-lod@w3.org>
Message-ID: <49B0F5A9.8080006@openlinksw.com>

Dan Brickley wrote:
> On 6/3/09 10:21, Yrjänä Rankka wrote:
>> Georgi Kobilarov wrote:
>>> Hi Kingsley,
>>>
>>> DESCRIBE <http://dbpedia.org/resource/London> takes 3 minutes to
>>> execute on lod.openlinksw.com ...
>>>
>> It took only a few seconds when I tried it. Takes time to warm up a pan
>> of this size, as is the case with any DBMS. As the working set
>> stabilizes in memory, results will come faster.
>
> What's the granularity of the warmup? If eg /resource/Paris hasn't 
> been directly viewed, will it benefit much from general warmup of 
> related resources that are mentioned in the queries for that entity?
>
Very likely so. Also in case of DESCRIBE 
<http://dbpedia.org/resource/London> the result of ~ 13MB takes a while 
to transfer as well. Though not quite 3 minutes - at least not through 
the pipe I'm connected to.

Here's the explanation of how the read-ahead works straight from the 
horse's mouth:

In general, looking for resources in a data set improves the working set 
for that data set.  There is some locality based on load order etc.

The disk format is 8K pages, 256 pages per extent of 2MB. 

It is 8 disks and 16 server processes, so disk is too narrow.  Disk 
reads are in general in parallel on all disks.

The random access transfer unit is 8K but if you get two reads hitting 
the same extent within a second of each other, the whole extent is read 
sequentially instead of the 2^nd single page request. 

So frequency of access drives bulk prefetching.  Then there is cache 
maintenance policies that differ between just prefetched and actually 
requested pages.  This is a tunable tradeoff between disk throughput and 
cache pollution.

Virtuoso IO is clever enough.  But the fact is that running from memory 
is 1000+ times faster than from disk on a random access workload and RDF 
is the very essence of random access.

> cheers
>
> Dan
>
Yrjänä

-- 
Yrjana Rankka            | ghard@openlinksw.com
Developer, Virtuoso Team | http://www.openlinksw.com
                         | Making Technology Work For You

Received on Friday, 6 March 2009 10:07:21 UTC