- From: Michael Ernst <mernst@alum.mit.edu>
- Date: Thu, 25 Sep 2008 10:58:30 +0200
- To: www-validator@w3.org
- Message-ID: <18651.21174.178232.425269@swsmde.ds.mpi-sws.mpg.de>
Checklink suffers an infinite loop when run with the -r switch. To reproduce the problem, unpack checklink-infinite-regress-testcase.tgz somewhere and then run checklink -D 2 -e ${BASE_URL}/checklink-infinite-regress-testcase If you change "-D 2" to "-r", then you get an infinite loop. The problem comes from relative URLs that contain multiple slashes where only one should appear. Such HTML is produced, for example, by Javadoc. You can see an example at http://groups.csail.mit.edu/pag/daikon/download/jdoc/binary_variables/package-tree.html where the two links with anchor text "PREV" are: <a href="..//package-tree.html"><b>PREV</b></a> Such a link works fine in my browser (Firefox), and checklink shouldn't infinite loop. I have attached a patch that corrects the problem. -Michael Ernst
Index: checklink =================================================================== RCS file: /sources/public/perl/modules/W3C/LinkChecker/bin/checklink,v retrieving revision 4.116 diff -u -u -b -r4.116 checklink --- checklink 22 Sep 2008 19:33:31 -0000 4.116 +++ checklink 25 Sep 2008 08:41:02 -0000 @@ -37,7 +37,7 @@ use LWP::UserAgent qw(); # if 0, ignore robots exclusion (useful for testing) -use constant USE_ROBOT_UA => 1; +use constant USE_ROBOT_UA => 0; if (USE_ROBOT_UA) { @W3C::UserAgent::ISA = qw(LWP::RobotUA); @@ -962,6 +962,10 @@ # Record all the links found while (my ($link, $lines) = each(%{$p->{Links}})) { my $link_uri = URI->new($link); + # Remove repeated slashes, to avoid duplicated checking or infinite + # recursion. Don't match the double slashes in "http://", however. + $link_uri =~ s|([^:])//+|$1/|g; + my $abs_link_uri = URI->new_abs($link_uri, $base); if ($Opts{Masquerade}) {
Attachments
- application/octet-stream attachment: checklink-infinite-regress-testcase.tgz
Received on Thursday, 25 September 2008 08:59:13 UTC