Re: Status of parallel link checker? from olivier Thereaux on 2006-04-03 (public-qa-dev@w3.org from April 2006)

From: olivier Thereaux <ot@w3.org>
Date: Mon, 3 Apr 2006 14:55:41 +0900
To: QA Dev <public-qa-dev@w3.org>
Message-Id: <8F74B3F1-2B08-4442-893F-14CEA5871D27@w3.org>

Ville,

On 3 Apr 2006, at 04:13, Ville Skyttä wrote:
> I took (finally) a brief look at CVS HEAD of link checker, and it is
> probably expectedly pretty broken at the moment.

I haven't made much progress since my comment on version 4.22 of the  
checklink script, that said "WARNING: this code is rather broken..."  
so I suspect yes, things are still pretty broken. I recall they did  
work faster, and I think it performed the basic tasks, but there were  
issues. I didn't pursue the experiment further, for reasons detailed  
below.

> I'm having some concerns about ParallelUserAgent not only because  
> of the
> missing included request inside responses [0]

Yes, I ran into that a couple of times, hence v4.24 of
http://dev.w3.org/cvsweb/perl/modules/W3C/LinkChecker/bin/checklink
Dom submitted one patch to Marc, and he said he'd give it a look, but  
he also kept his promise that "it wouldn't be quick" :/

> but also because it does
> not actually sleep between requests to a host but does something weird
> instead

ouch, you mean ParallelUserAgent does that? or is it something that  
the current linkchecker code does wrong in this regard?

> So, what's the general status of the parallel link checking stuff,  
> is someone subconsciously or otherwise working on it?

Not working on it at the moment... Basically, I often find myself  
regretting having accepted the request to have the link checker  
follow robots.txt rules. Not only did it make the tool awfully slow  
(I only ever use it from the commandline anymore, using it from a  
browser pains me), in many cases it also means that links will have  
to be checked by hand. Hence my relative cold feet about  
ParallelRobotUA, or any RobotUA solution in general.

If a browser-based widget (either ajax or proprietary browser plugin)  
were to do link checking today, I don't really expect that there  
would be protests to get them to follow robots.txt. Avoid slamming  
remote servers, probably, but respect Disallow: etc., probably not.  
The more I think of this, the more I look at your "ack" [1] with  
interest, and think it could/should be the replacement for the web- 
based link checker (while still distributing the older as perl module/ 
command-line tool).

-- 
olivier

Received on Monday, 3 April 2006 05:55:52 UTC