Re: [checklink] list of potential solutions to timing issues

On Fri, 2005-03-11 at 16:06 +0900, olivier Thereaux wrote:

A quick reply to the first point (I'm in a hurry now), more next week.

> Cat 1: make the link checker faster
> * RobotUA has a minimum latency of 1s between requests, so we can't 
> make one W3C::UserAgent instance be much faster, but we could use 
> several.

Right, assuming we can get those UserAgents to crawl in parallel.  But
that would require some possibly nontrivial changes to the link checker.

For example, not too long ago, I specifically made sure that we use only
one UA; before that, IIRC WLC was fetching /robots.txt for every single
link due to robots info not being shared between different UAs spawned
here and there.  Also, the report on progress needs to be revamped if
there are multiple UAs working simultaneously.

I have a feeling that it's not feasible to implement this in the current
codebase before m12n.

> I assume this is what the following comment means:
> [[
> My $ua = W3C::UserAgent->new($AGENT); # @@@ TODO: admin address
> # @@@ make number of keep-alive connections customizable
> ]]

Not really.  A single UA can have a number of kept-alive connections,
that's what this comment is about.  I remember thinking about sorting
the list of known to-be-checked URLs (after canonicalization) and
processing in that order for better keep-alive utilization, but I don't
remember if I did anything about that.

> Having a configurable number could make sense. We could also spawn one 
> W3C::UserAgent per target host (would require changes in how and when 
> the parsing of the links are done, I suppose?)

Probably right.  See also above for other things involving sharing stuff
between the agents.

I'll look through this list early next week when I have more time.

Received on Friday, 11 March 2005 07:54:58 UTC