Re: Making Linkchecker working multithreaded from Ville Skyttä on 2005-06-24 (public-qa-dev@w3.org from June 2005)

From: Ville Skyttä <ville.skytta@iki.fi>
Date: Fri, 24 Jun 2005 15:04:40 +0300
To: QA-dev <public-qa-dev@w3.org>
Message-Id: <1119614680.22291.26.camel@localhost.localdomain>

On Wed, 2005-06-22 at 15:56 +0200, Dominique Hazaël-Massieux wrote:
> Hi QA-dev, Ville,

Hi,

> I've had a quick look at the linkchecker to see what would be needed to
> make it multithreaded; I see the linkchecker is using LWP::RobotUA. Has
> any thought being put to use LWP::Parallel::RobotUA [1] instead?

Some thoughts about parallelizing the fetches have been tossed around on
this list and the meetings in the past, but AFAIK nothing concrete has
really happened yet.  My feeling is that it has been seen as a good
thing in general.

>  It's a
> derivative of LWP::Parallel:UserAgent [2] which should allow to do a
> per-host thread using register/callback functions. There are several
> examples documented [3], esp. one using the RobotUA class.

I'm vaguely aware of this, and should really get more familiar with it
sometime soon.  Haven't found round tuits to do it so far, though.

> I haven't investigated much how this would apply to the linkchecker; I
> guess my first question is whether this has already been evaluated but
> discarded or not. (I haven't found anything with the mailing list search
> engine, but it may have been discussed in other fora)

As said, I think it has been discussed here or the meetings somewhat,
but I don't have any pointers to throw in right now.

Anyway, the idea has certainly not been discarded.  Personally, my only
concern about parallelizing the link checker is that it might cause some
complications or restrictions on how to present the results to the user.
I think we all agree that the results UI needs some work anyway, and it
isn't quite clear what would be The Way to implement it; the most
serious problem currently being the timeout issues (either server or
client side).

Assuming we keep the results output relatively close to what it is now:
we need to synchronize stuff on the output stream to the client either
on the callback level (OTOH callbacks shouldn't really print anything to
the stream IMO), or to implement an event sink of some kind that takes
care of the output stream while the checking proceeds, or to buffer the
results more than now and present them in bigger chunks (which could
make the timeout problems even worse than now).

These are more random than refined thoughts, and it might well turn out
to be that the concerns go away as someone just starts to experiment.
But currently I tend to think that before starting serious work on the
parallelization, we should first have a decision how we would like to
present the results to the user in the future.

Received on Friday, 24 June 2005 12:04:47 UTC