Re: Making Linkchecker working parallel from olivier Thereaux on 2005-07-20 (public-qa-dev@w3.org from July 2005)

From: olivier Thereaux <ot@w3.org>
Date: Wed, 20 Jul 2005 21:18:30 +0900
To: olivier Thereaux <ot@w3.org>
Cc: Dominique Hazaël-Massieux <dom@w3.org>, public-qa-dev@w3.org
Message-Id: <FFD153CF-9770-4AE8-8ED2-8C7774F8CB97@w3.org>

On 30 Jun 2005, at 19:20, olivier Thereaux wrote:
> LWP::Parallel::RobotUA has two ways of giving back the results,  
> either through a callback sub which the UA calls for each chunk of  
> data received, or just wait() until it's done (the wait() method  
> will return the array of responses).
>
> The former is (perhaps) more compatible with our output model but  
> (perhaps) a bit harder to implement. On the other hand the wait()  
> method is easier to implement, but not very good in terms of output  
> (all the output for every link of a document would come at once)...  
> One idea (beware - rough braindump) would be to store all the links  
> in a per-host hash, and process a block of n (n=4?5?) at a time,  
> trying not to process several links to a single host within a given  
> block.

The documentation for LWP::Parallel::* is a bit scattered, so after  
giving it a further look, I found that there is a "middle ground"  
between the rather awkward chunk-based callback subroutine, and the  
slow-return batch wait().

  http://search.cpan.org/~marclang/ParallelUserAgent-2.57/lib/LWP/ 
Parallel.pm has:
   # on_return gets called whenever a connection (or its callback)
   # returns EOF (or any other terminating status code available for
   # callback functions). Please note that on_return gets called for
   # any successfully terminated HTTP connection! This does not imply
   # that the response sent from the server is a success!

The example goes on to give a simple print() approach, and it looks  
very promising, since we could add to our database of checked links  
and output info once each link is fetched. The best thing is that  
on_return is here to be overriden in a subclass of  
LWP::Parallel::RobotUA, which is exactly what our W3C::UserAgent is.

so instead of

[[

package W3C::UserAgent;

use LWP::RobotUA 1.19 qw();
# @@@ Needs also W3C::LinkChecker but can't use() it here.

@W3C::UserAgent::ISA = qw(LWP::RobotUA);

]]

we could have

[[

package W3C::UserAgent;

#use LWP::RobotUA 1.19 qw();
use Exporter();
use LWP::Parallel::RobotUA qw(:CALLBACK);
# @@@ Needs also W3C::LinkChecker but can't use() it here.

@W3C::UserAgent::ISA = qw(LWP::Parallel::RobotUA Exporter);
@W3C::UserAgent::EXPORT = @LWP::Parallel::RobotUA::EXPORT_OK;

]]

One other cool thing of this parallel robotUA is that we can setup  
the number of hosts crawled in parallel, and the number of documents  
crawled in parallel for each host.

I wrote earlier

> In particular I like these two options:
> $ua->max_hosts ( $max )
>     Changes the maximum number of locations accessed in parallel.  
> The default value is 7.
>
> $ua->max_req ( $max )
>   Changes the maximum number of requests issued per host in  
> parallel. The default value is 5.
>
> I think this means we could greatly improve the speed of the link  
> checker by setting the latter to 1, and the former to... something  
> reasonably high.

But actually, we could set $ua->max_req to, say, 2 or 3 and still  
have a robotUA that "behaves" nicely with each server... For a  
document such as the W3C homepage, quite heavy on intra-linking (130  
links to www.w3.org out of 160 links last time I checked), instead of  
taking ~ 200 seconds, it could take as little as 40-60, which is much  
closer to our goal.

-- 
olivier

Received on Wednesday, 20 July 2005 12:18:30 UTC