W3C home > Mailing lists > Public > www-validator@w3.org > September 2008

checklink: infinite loop

From: Michael Ernst <mernst@alum.mit.edu>
Date: Thu, 25 Sep 2008 10:58:30 +0200
Message-ID: <18651.21174.178232.425269@swsmde.ds.mpi-sws.mpg.de>
To: www-validator@w3.org
Checklink suffers an infinite loop when run with the -r switch.

To reproduce the problem, unpack checklink-infinite-regress-testcase.tgz
somewhere and then run

  checklink -D 2 -e ${BASE_URL}/checklink-infinite-regress-testcase

If you change "-D 2" to "-r", then you get an infinite loop.

The problem comes from relative URLs that contain multiple slashes where
only one should appear.  Such HTML is produced, for example, by Javadoc.
You can see an example at

  http://groups.csail.mit.edu/pag/daikon/download/jdoc/binary_variables/package-tree.html

where the two links with anchor text "PREV" are:

  <a href="..//package-tree.html"><b>PREV</b></a>

Such a link works fine in my browser (Firefox), and checklink shouldn't
infinite loop.

I have attached a patch that corrects the problem.

                    -Michael Ernst


Index: checklink
===================================================================
RCS file: /sources/public/perl/modules/W3C/LinkChecker/bin/checklink,v
retrieving revision 4.116
diff -u -u -b -r4.116 checklink
--- checklink	22 Sep 2008 19:33:31 -0000	4.116
+++ checklink	25 Sep 2008 08:41:02 -0000
@@ -37,7 +37,7 @@
 use LWP::UserAgent      qw();
 
 # if 0, ignore robots exclusion (useful for testing)
-use constant USE_ROBOT_UA => 1;
+use constant USE_ROBOT_UA => 0;
 
 if (USE_ROBOT_UA) {
   @W3C::UserAgent::ISA = qw(LWP::RobotUA);
@@ -962,6 +962,10 @@
   # Record all the links found
   while (my ($link, $lines) = each(%{$p->{Links}})) {
     my $link_uri = URI->new($link);
+    # Remove repeated slashes, to avoid duplicated checking or infinite
+    # recursion.  Don't match the double slashes in "http://", however.
+    $link_uri =~ s|([^:])//+|$1/|g;
+
     my $abs_link_uri = URI->new_abs($link_uri, $base);
 
     if ($Opts{Masquerade}) {


Received on Thursday, 25 September 2008 08:59:13 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 25 April 2012 12:14:31 GMT