checklink inside a firewall from CLOSE Dave on 2007-07-24 (www-validator@w3.org from July 2007)

From: CLOSE Dave <Dave.Close@us.thalesgroup.com>
Date: Mon, 23 Jul 2007 18:24:42 -0700
To: <www-validator@w3.org>
Message-ID: <32DD5A0A6E31FC49B438F37F36C96D480A7A80@taus-ies1.dom1.taus.us.thales>

I'm trying to run checklink against an internal web site with several
thousand pages. Every page includes a pair of links to outside URLs that
cannot be accessed except through the company proxy. However, if the
proxy is enabled, the internal pages cannot be accessed. So, when I run
checklink with the proxy disabled, I get two errors for every page (and
the process takes much longer than it should).

I'm looking for a way to specify that some links should not be checked.
Superficially, --exclude-docs would seem to do the job, but no; it would
prevent checking the content of the subordinate page (if I could get to
it), but does nothing to prevent checking links TO the page.

Since the W3C validator seems to be the gold standard, and no
alternatives are suggested by the top hundred or so responses to a
Google search, I hope I've just missed a technique. Any suggestions?

A second issue arises if I try to parse the output of the validator with
all these extraneous errors. The errors themselves are reported on
separate lines from the link and page which caused them. Using grep to
find the errors doesn't reveal the source of the problems. Of course, I
could write a program or script to handle this problem, but anything
which saves context is likely to slow down scanning - and my output ran
to more than 500 MB! I'd sure like to see output similar to Apache's log
files with all related information on a single line. I can do pretty
formatting myself.
-- 
Dave Close, Thales Avionics, Irvine California USA
Software engineering manager, Hardware engineering department
cell +1 949 394 2124, dave.close@us.thalesgroup.com

Received on Tuesday, 24 July 2007 09:29:52 UTC