Re: checklink inside a firewall

On Tuesday 24 July 2007, CLOSE Dave wrote:
> I'm trying to run checklink against an internal web site with several
> thousand pages. Every page includes a pair of links to outside URLs that
> cannot be accessed except through the company proxy. However, if the
> proxy is enabled, the internal pages cannot be accessed. So, when I run
> checklink with the proxy disabled, I get two errors for every page (and
> the process takes much longer than it should).

Have you tried setting the no_proxy environment variable?  It takes a comma 
separated list of domains for which proxy should not be used.  Something like 
this in the environment could work:

http_proxy=http://your.proxy.server/
https_proxy=http://your.proxy.server/
ftp_proxy=http://your.proxy.server/
no_proxy=your.intranet.domain

See the LWP::UserAgent documentation for env_proxy() for more information.
http://search.cpan.org/dist/libwww-perl/lib/LWP/UserAgent.pm#%24ua-%3Eenv_proxy

> I'm looking for a way to specify that some links should not be checked.

That has been implemented in the CVS version of the link checker and will be 
in the next release.  The name of the option to do that is --exclude, CVS is 
available at http://dev.w3.org/cvsweb/perl/modules/W3C/LinkChecker/

> A second issue arises if I try to parse the output of the validator with
> all these extraneous errors. The errors themselves are reported on
> separate lines from the link and page which caused them.

There's a related RFE which would require that too filed as 
http://www.w3.org/Bugs/Public/show_bug.cgi?id=382 , no ETA for implementation 
at the moment.

> Using grep to 
> find the errors doesn't reveal the source of the problems.

grep's -A and -B options could help a bit with that.

Received on Tuesday, 24 July 2007 15:45:00 UTC