Re: [checklink] list of potential solutions to timing issues

On Mar 11, 2005, at 16:54, Ville Skyttä wrote:
>> Cat 1: make the link checker faster
>> * RobotUA has a minimum latency of 1s between requests, so we can't
>> make one W3C::UserAgent instance be much faster, but we could use
>> several.
>
> Right, assuming we can get those UserAgents to crawl in parallel.  But
> that would require some possibly nontrivial changes to the link 
> checker.

Agreed. Still, the more I think of it, the more I am convinced that 
this is our best bet.

What we have now in sub check_uri is basically
  # check document
  # Record all the links found
  foreach my $u (keys %links) {
    ...
    &check_validity();
  }
  ...

I suppose that before the foreach loop we could classify the links by 
host (or by IP, although the extra DNS queries might slow us down) and 
spawn n UAs to check the validity of each.

> For example, not too long ago, I specifically made sure that we use 
> only
> one UA; before that, IIRC WLC was fetching /robots.txt for every single
> link due to robots info not being shared between different UAs spawned
> here and there.

Are we using / could we use a global hash of checked robots rule files 
along %processed, %results and %redirects?

>  Also, the report on progress needs to be revamped if
> there are multiple UAs working simultaneously.

Not much, if I am not mistaken. As far as I can tell, the output for 
"starting to process URI foo" and "URI foo processed in X time - result 
bar" are separate.

> I have a feeling that it's not feasible to implement this in the 
> current
> codebase before m12n.

I would say it's probably feasible, but it makes more sense to do it 
after, or during, m12n. I am mostly trying to come up with a solution 
we could aim to implement when ever ready...

-- 
olivier

Received on Monday, 11 April 2005 06:56:14 UTC