Re: Making Linkchecker working parallel from olivier Thereaux on 2005-06-30 (public-qa-dev@w3.org from June 2005)

From: olivier Thereaux <ot@w3.org>
Date: Thu, 30 Jun 2005 19:20:33 +0900
To: olivier Thereaux <ot@w3.org>
Cc: Dominique Hazaël-Massieux <dom@w3.org>, public-qa-dev@w3.org
Message-Id: <BE560D5E-BC18-4E6B-B19B-83C8CB4FFD12@w3.org>

On 27 Jun 2005, at 13:30, olivier Thereaux wrote:
> On 22 Jun 2005, at 22:56, Dominique Hazaël-Massieux wrote:
>
>> I've had a quick look at the linkchecker to see what would be  
>> needed to
>> make it multithreaded; I see the linkchecker is using  
>> LWP::RobotUA. Has
>> any thought being put to use LWP::Parallel::RobotUA [1] instead?
>>
>
> This is a really good idea, thanks! As Ville said, some such ideas  
> have been thrown around with the same goal, but our best bet so far  
> was to try and have parallel RobotUA instances, which would have  
> been problematic in many ways.
>
> This looks promising, as it would certainly remove some of the  
> implementation concerns. Instead of having to track everything  
> ourselves, it seems that this LWP::Parallel::RobotUA can be given,  
> at any time new documents to process (by 'registering' new  
> requests), then you wait for some time, and fetch the results.

I had a further look at it today. The changes to make it work are  
minimal, the changes to make it work efficiently are, I think, small.

At the moment, checklink proceeds this way:

GET the initial document --> sub get_document --> sub get_uri
parse the document, extract links and anchors
foreach link { ---> sub chech_validity
    HEAD (or GET if there are anchors at play) link --> sub get_uri
    display result; store;
}
proceed to next document

A simple way to parallelize the work would be to have the "foreach  
link" not immediatly loop and check sequentially the links, but  
instead register() them into the LWP::Parallel::RobotUA for parallel  
handling...

LWP::Parallel::RobotUA has two ways of giving back the results,  
either through a callback sub which the UA calls for each chunk of  
data received, or just wait() until it's done (the wait() method will  
return the array of responses).

The former is (perhaps) more compatible with our output model but  
(perhaps) a bit harder to implement. On the other hand the wait()  
method is easier to implement, but not very good in terms of output  
(all the output for every link of a document would come at once)...  
One idea (beware - rough braindump) would be to store all the links  
in a per-host hash, and process a block of n (n=4?5?) at a time,  
trying not to process several links to a single host within a given  
block.

thoughts?
-- 
olivier

Received on Thursday, 30 June 2005 10:20:24 UTC