Re: Attempt to solve the job runner situation from Ryan Lane on 2013-11-06 (public-webplatform@w3.org from November 2013)

From: Ryan Lane <rlane32@gmail.com>
Date: Wed, 6 Nov 2013 01:18:46 -0600
To: Renoir Boulanger <renoir@w3.org>
Cc: List WebPlatform public <public-webplatform@w3.org>
Message-ID: <CALKgCA3WwYX+chnib1ZWS9WGN=nfebGatN1ujn+_BpWL=5pTDA@mail.gmail.com>

Also notice the ganglia graph for db1:

http://monitor.webplatform.org/ganglia/?r=week&c=db&h=db1.webplatform.org&mc=2

The issues started with the database issues we had earlier and stopped when
I fixed the SMW data.


On Wed, Nov 6, 2013 at 1:13 AM, Ryan Lane <rlane32@gmail.com> wrote:

> On Wed, Nov 6, 2013 at 12:30 AM, Renoir Boulanger <renoir@w3.org> wrote:
>
>> Hi all,
>>
>> Tonight, between 23:00 to 24:30 EST, Ryan and I had the chance to work on
>> the job runner situation [0].
>>
>> The problem was two-fold. Part of the problem was due to a high level of
>> database writes that was filling the database server storage partitions.
>> The main reason of the problem was because of an ever increasing quantity
>> of jobs being added to MediaWiki's job runner system. Details of the issue
>> are described in [0].
>>
>> Symptoms of the problem were described in those e-mails: [1][2][3]
>>
>> Ryan took the decision to truncate SemanticMediaWiki data and delete all
>> SemanticMediaWiki related jobs.
>>
>> We hope it solves the situation.
>>
>>
> The problem is indeed solved. I believe the SMW data was corrupted,
> causing it to continuously add jobs into the queue to fix its data. I
> truncated the job table, then refreshed the SMW data using a maintenance
> script. The current job queue is at 0.
>
> - Ryan
>

Received on Wednesday, 6 November 2013 07:19:33 UTC