- From: Michael Ernst <mernst@alum.mit.edu>
- Date: Thu, 25 Sep 2008 10:58:30 +0200
- To: www-validator@w3.org
- Message-ID: <18651.21174.178232.425269@swsmde.ds.mpi-sws.mpg.de>
Checklink suffers an infinite loop when run with the -r switch.
To reproduce the problem, unpack checklink-infinite-regress-testcase.tgz
somewhere and then run
checklink -D 2 -e ${BASE_URL}/checklink-infinite-regress-testcase
If you change "-D 2" to "-r", then you get an infinite loop.
The problem comes from relative URLs that contain multiple slashes where
only one should appear. Such HTML is produced, for example, by Javadoc.
You can see an example at
http://groups.csail.mit.edu/pag/daikon/download/jdoc/binary_variables/package-tree.html
where the two links with anchor text "PREV" are:
<a href="..//package-tree.html"><b>PREV</b></a>
Such a link works fine in my browser (Firefox), and checklink shouldn't
infinite loop.
I have attached a patch that corrects the problem.
-Michael Ernst
Index: checklink
===================================================================
RCS file: /sources/public/perl/modules/W3C/LinkChecker/bin/checklink,v
retrieving revision 4.116
diff -u -u -b -r4.116 checklink
--- checklink 22 Sep 2008 19:33:31 -0000 4.116
+++ checklink 25 Sep 2008 08:41:02 -0000
@@ -37,7 +37,7 @@
use LWP::UserAgent qw();
# if 0, ignore robots exclusion (useful for testing)
-use constant USE_ROBOT_UA => 1;
+use constant USE_ROBOT_UA => 0;
if (USE_ROBOT_UA) {
@W3C::UserAgent::ISA = qw(LWP::RobotUA);
@@ -962,6 +962,10 @@
# Record all the links found
while (my ($link, $lines) = each(%{$p->{Links}})) {
my $link_uri = URI->new($link);
+ # Remove repeated slashes, to avoid duplicated checking or infinite
+ # recursion. Don't match the double slashes in "http://", however.
+ $link_uri =~ s|([^:])//+|$1/|g;
+
my $abs_link_uri = URI->new_abs($link_uri, $base);
if ($Opts{Masquerade}) {
Attachments
- application/octet-stream attachment: checklink-infinite-regress-testcase.tgz
Received on Thursday, 25 September 2008 08:59:13 UTC