- From: olivier Thereaux <ot@w3.org>
- Date: Thu, 30 Jun 2005 19:20:33 +0900
- To: olivier Thereaux <ot@w3.org>
- Cc: Dominique Hazaël-Massieux <dom@w3.org>, public-qa-dev@w3.org
On 27 Jun 2005, at 13:30, olivier Thereaux wrote: > On 22 Jun 2005, at 22:56, Dominique Hazaël-Massieux wrote: > >> I've had a quick look at the linkchecker to see what would be >> needed to >> make it multithreaded; I see the linkchecker is using >> LWP::RobotUA. Has >> any thought being put to use LWP::Parallel::RobotUA [1] instead? >> > > This is a really good idea, thanks! As Ville said, some such ideas > have been thrown around with the same goal, but our best bet so far > was to try and have parallel RobotUA instances, which would have > been problematic in many ways. > > This looks promising, as it would certainly remove some of the > implementation concerns. Instead of having to track everything > ourselves, it seems that this LWP::Parallel::RobotUA can be given, > at any time new documents to process (by 'registering' new > requests), then you wait for some time, and fetch the results. I had a further look at it today. The changes to make it work are minimal, the changes to make it work efficiently are, I think, small. At the moment, checklink proceeds this way: GET the initial document --> sub get_document --> sub get_uri parse the document, extract links and anchors foreach link { ---> sub chech_validity HEAD (or GET if there are anchors at play) link --> sub get_uri display result; store; } proceed to next document A simple way to parallelize the work would be to have the "foreach link" not immediatly loop and check sequentially the links, but instead register() them into the LWP::Parallel::RobotUA for parallel handling... LWP::Parallel::RobotUA has two ways of giving back the results, either through a callback sub which the UA calls for each chunk of data received, or just wait() until it's done (the wait() method will return the array of responses). The former is (perhaps) more compatible with our output model but (perhaps) a bit harder to implement. On the other hand the wait() method is easier to implement, but not very good in terms of output (all the output for every link of a document would come at once)... One idea (beware - rough braindump) would be to store all the links in a per-host hash, and process a block of n (n=4?5?) at a time, trying not to process several links to a single host within a given block. thoughts? -- olivier
Received on Thursday, 30 June 2005 10:20:24 UTC