checklink: suppress expected errors to avoid false positive warnings

Sometimes, a user expects that checklink will produce certain warnings.
Some reasons include robot exclusion rules, password-protected content, and
errors in automatically-generated content.

A user would prefer checklink to show only the unexpected warnings, rather
than hiding them in an avalance of uninteresting output.

This patch adds flags that suppress certain warnings.  These flags
complement the existing --exclude and --exclude-docs flags.  (The patch
also permits --exclude-docs to be supplied multiple times instead of just
once.)

Here is a snippet from the (new) "checklink --help" output.

 --exclude-redirect URI->URI  Do not report a redirect from the first to the
                            second URI.  The "->" is literal text.
 --exclude-redirect-prefix URI->URI  Do not report a redirect from a child of
                            the first URI to the same child of the second
                            URI.  The "->" is literal text.
 --exclude-broken CODE:URI  Do not report a broken link with the given CODE.
                            CODE is HTTP response, or -1 for robots exclusion.
                            The ":" is literal text.
 --exclude-fragment URL#FRAG  Do not report the given broken fragment.
                            The "#" is literal text.

With this patch, I am able to regularly check large sets of webpages for
broken links, with no warning output in the common case.  Below the patch,
I have attached the patch, and also an example of some arguments that I
pass to checklink.

                    -Michael Ernst
Index: checklink
===================================================================
RCS file: /sources/public/perl/modules/W3C/LinkChecker/bin/checklink,v
retrieving revision 4.118
diff -u -u -b -r4.118 checklink
--- checklink	17 Oct 2008 04:36:19 -0000	4.118
+++ checklink	17 Oct 2008 07:09:13 -0000
@@ -370,6 +370,10 @@
     Base_Locations    => [],
     Exclude           => undef,
     Exclude_Docs      => undef,
+    Exclude_Redirect  => undef,
+    Exclude_Redirect_Prefix => undef,
+    Exclude_Broken    => undef,
+    Exclude_Fragment  => undef,
     Masquerade        => 0,
     Masquerade_From   => '',
     Masquerade_To     => '',
@@ -401,10 +405,6 @@
   eval { $Opts{Exclude} = qr/$Opts{Exclude}/o; };
   &usage(1, "Error in exclude regexp: $@") if $@;
 }
-if (defined($Opts{Exclude_Docs})) {
-  eval { $Opts{Exclude_Docs} = qr/$Opts{Exclude_Docs}/o; };
-  &usage(1, "Error in exclude-docs regexp: $@") if $@;
-}
 if (defined($Opts{Trusted})) {
   eval { $Opts{Trusted} = qr/$Opts{Trusted}/io; };
   &usage(1, "Error in trusted domains regexp: $@") if $@;
@@ -616,7 +616,11 @@
                                           if $Opts{Depth} == 0; },
              'l|location=s'    => \@locs,
              'X|exclude=s',    => \$Opts{Exclude},
-             'exclude-docs=s', => \$Opts{Exclude_Docs},
+             'exclude-docs=s@', => \$Opts{Exclude_Docs},
+             'exclude-redirect=s@', => \$Opts{Exclude_Redirect},
+             'exclude-redirect-prefix=s@', => \$Opts{Exclude_Redirect_Prefix},
+             'exclude-broken=s@', => \$Opts{Exclude_Broken},
+             'exclude-fragment=s@', => \$Opts{Exclude_Fragment},
              'u|user=s'        => \$Opts{User},
              'p|password=s'    => \$Opts{Password},
              't|timeout=i'     => \$Opts{Timeout},
@@ -699,6 +703,16 @@
                             as --exclude-docs with the same regexp would.
  --exclude-docs REGEXP      In recursive mode, do not check links in documents
                             whose full, canonical URIs match REGEXP.
+ --exclude-redirect URI->URI  Do not report a redirect from the first to the
+                            second URI.  The \"->\" is literal text.
+ --exclude-redirect-prefix URI->URI  Do not report a redirect from a child of
+                            the first URI to the same child of the second
+                            URI.  The \"->\" is literal text.
+ --exclude-broken CODE:URI  Do not report a broken link with the given CODE.
+                            CODE is HTTP response, or -1 for robots exclusion.
+                            The \":\" is literal text.
+ --exclude-fragment URL#FRAG  Do not report the given broken fragment.
+                            The \"#\" is literal text.
  -L, --languages LANGS      Accept-Language header to send.  The special value
                             'auto' causes autodetection from the environment.
  -R, --no-referer           Do not send the Referer HTTP header.
@@ -1202,9 +1216,14 @@
 
   my $candidate = URI->new($uri)->canonical();
 
-  return 0
-      if ((defined($Opts{Exclude}) && $candidate =~ $Opts{Exclude}) ||
-          (defined($Opts{Exclude_Docs}) && $candidate =~ $Opts{Exclude_Docs}));
+  return 0 if (defined($Opts{Exclude}) && $candidate =~ $Opts{Exclude});
+  if (defined($Opts{Exclude_Docs})) {
+    for my $excluded_doc (@{$Opts{Exclude_Docs}}) {
+      if ($candidate =~ $excluded_doc) {
+        return 0;
+      }
+    }
+  }
 
   foreach my $base (@{$Opts{Base_Locations}}) {
     my $rel = $candidate->rel($base);
@@ -1213,7 +1232,7 @@
     return 1;
   }
 
-  return 0; # We always have at least one base location.
+  return 0; # We always have at least one base location, but none matched.
 }
 
 ##################################################
@@ -1359,6 +1378,19 @@
     $results{$uri}{location}{orig_message} = $tmp->message() || '(no message)';
   }
   $results{$uri}{location}{success} = $response->is_success();
+
+  # If a suppressed broken link, fill the data structure like a typical success.
+  # print STDERR "success? " . $results{$uri}{location}{success} . ": $uri\n";
+  if (! $results{$uri}{location}{success}) {
+    my $code = $results{$uri}{location}{code};
+    my $match = grep { $_ eq "$code:$uri" } @{$Opts{Exclude_Broken}};
+    if ($match) {
+      $results{$uri}{location}{success} = 1;
+      $results{$uri}{location}{code} = 100;
+      $results{$uri}{location}{display} = 100;
+    }
+  }
+
   # Stores the authentication information
   if (defined($response->{Realm})) {
     $results{$uri}{location}{realm} = $response->{Realm};
@@ -1728,7 +1760,8 @@
   # Check that the fragments exist
   foreach my $fragment (keys %{$links->{$uri}{fragments}}) {
     if (defined($p->{Anchors}{$fragment})
-        || &escape_match($fragment, $p->{Anchors})) {
+        || &escape_match($fragment, $p->{Anchors})
+        || grep { $_ eq "$uri#$fragment" } @{$Opts{Exclude_Fragment}}) {
       $results{$uri}{fragments}{$fragment} = 1;
     } else {
       $results{$uri}{fragments}{$fragment} = 0;
@@ -1822,6 +1855,44 @@
 {
   my ($redirects, $response) = @_;
   for (my $prev = $response->previous(); $prev; $prev = $prev->previous()) {
+
+    # Check for redirect match.
+    my $from = $prev->request()->url();
+    my $to = $response->request()->url(); # same on every loop iteration
+    my $from_to = $from . '->' . $to;
+    my $match = grep { $_ eq $from_to } @{$Opts{Exclude_Redirect}};
+    # print STDERR "Result $match of checking $from_to\n";
+    if ($match) { next; }
+
+    # Check for redirect_prefix match
+    my $prefix_match = 0;
+    my $from_len = length($from);
+    my $to_len = length($to);
+    for my $redir_prefix (@{$Opts{Exclude_Redirect_Prefix}}) {
+      if ($redir_prefix !~ /^(.*)->(.*)$/) {
+        die "Bad exclude-redirect-prefix: $redir_prefix";
+      }
+      my $from_prefix = $1;
+      my $to_prefix = $2;
+      my $from_prefix_len = length($from_prefix);
+      my $to_prefix_len = length($to_prefix);
+      if (($from eq $from_prefix) && ($to eq $to_prefix)) {
+        $prefix_match = 1;
+        last;
+      } elsif (($from_prefix_len < $from_len)
+                 && ($to_prefix_len < $to_len)
+                 && ($from_prefix eq substr($from, 0, $from_prefix_len))
+                 && ($to_prefix eq substr($to, 0, $to_prefix_len))
+                 && (substr($from, $from_prefix_len) eq substr($to, $to_prefix_len))) {
+        $prefix_match = 1;
+        last;
+      }
+    }
+    if ($prefix_match) {
+      # print STDERR "AN EXCLUDED REDIRECT:\n  $from\n  $to\n";
+      next;
+    }
+
     $redirects->{$prev->request()->url()} = $response->request()->url();
   }
   return;
--exclude-broken -1:http://whereis.mit.edu/map-jpg?selection=32&Buildings=go 
--exclude-broken 302:MAY-NEED-TO-ALSO-LIST-IN-exclude-redirect-CLAUSE
--exclude-broken 302:http://ieeexplore.ieee.org/
--exclude-broken 302:http://www.hotelatmit.com/
--exclude-broken 403:http://validator.w3.org/check?uri=referer
--exclude-broken 403:http://www.acm.org/
--exclude-broken 403:http://www.acm.org/sigs/volunteer_resources/conference_manual/
--exclude-broken 403:http://www.cs.washington.edu/orgs/student-affairs/gsc/jobs/
--exclude-broken 403:http://www.elsevier.nl/locate/disc/
--exclude-broken 403:https://www.csail.mit.edu/mrbs/
--exclude-broken 404:file://afs/csail/group/pag/software/pkg/freshmeat-submit-1.6/freshmeat-submit.html
--exclude-broken 404:http://groups.google.com/group/jsr-305/
--exclude-broken 404:http://groups.google.com/group/jsr-305/web/proposed-annotations
--exclude-broken 404:http://java.sun.com/javase/6/docs/jdk/api/javac/tree/com/sun/source/tree/AnnotatedTypeTree.html?is-external=true
--exclude-broken 404:http://www.eclipse.org/legal/Eclipse%20EPL%202003_11_10%20Final_files/filelist.xml
--exclude-broken 405:http://www.amazon.com/exec/obidos/tg/detail/-/0321205758/103-2932545-5299831?v=glance
--exclude-broken 412:http://validator.w3.org/check?uri=referer
--exclude-broken 500:http://conferences.iee.org/icse2004/
--exclude-broken 500:http://jcp.org/en/jsr/detail?id=308
--exclude-broken 500:http://www.orbitz.com/
--exclude-broken 500:https://ca.mit.edu:444/moira/showresult.jhtml?list=parg&operation=displaylistinfo
--exclude-broken 500:https://eecsfacweb.mit.edu/
--exclude-broken 500:https://web.mit.edu/21.guide/www/l-rec-ob.htm
--exclude-broken 500:https://web.mit.edu/21.guide/www/l-rec-wr.htm
--exclude-broken 500:https://web.mit.edu/21.guide/www/toc.htm
--exclude-broken 500:https://web.mit.edu/6.033/www/staff/
--exclude-broken 500:https://web.mit.edu/6.170/staff/
--exclude-broken 500:https://web.mit.edu/6.170/staff/staging/www/
--exclude-broken 500:https://www.cvshome.org/docs/manual/cvs-1.11.18/cvs_5.html
--exclude-broken 501:http://www.bizrate.com/
--exclude-broken 501:https://web.mit.edu/21.guide/www/l-rec-wr.htm
--exclude-broken 501:https://web.mit.edu/21.guide/www/toc.htm
--exclude-broken 503:http://www.marriott.com/hotels/travel/boscb-boston-marriott-cambridge/
--exclude-broken 503:http://www.marriott.com/hotels/travel/boscm-residence-inn-boston-cambridge/
--exclude-docs /~adonovan/
--exclude-docs bugzilla/
--exclude-docs daikon/download/jdoc
--exclude-docs http://pag.csail.mit.edu/jsr308/dist/doc/javac_lifecycle
--exclude-docs mernst/(public_html/)?(ir95|advice/conference/)
--exclude-fragment http://groups.csail.mit.edu/pag/jsr308/current/doc/checkers/igj/quals/I.html#annotation_type_element_detail
--exclude-fragment http://groups.csail.mit.edu/pag/jsr308/current/doc/checkers/quals/DefaultQualifiers.html#annotation_type_element_detail
--exclude-fragment http://groups.csail.mit.edu/pag/jsr308/current/doc/checkers/quals/ImplicitFor.html#annotation_type_element_detail
--exclude-fragment http://java.sun.com/javase/6/docs/api/javax/lang/model/SourceVersion.html?is-external=true#RELEASE_7
--exclude-fragment http://java.sun.com/javase/6/docs/jdk/api/javac/tree/com/sun/source/util/SimpleTreeVisitor.html?is-external=true#visitAnnotatedType(com.sun.source.tree.AnnotatedTypeTree,%20P)
--exclude-fragment http://java.sun.com/javase/6/docs/jdk/api/javac/tree/com/sun/source/util/SimpleTreeVisitor.html?is-external=true#visitAnnotatedType(com.sun.source.tree.AnnotatedTypeTree,%20P)
--exclude-fragment http://java.sun.com/javase/6/docs/jdk/api/javac/tree/com/sun/source/util/TreeScanner.html?is-external=true#visitAnnotatedType(com.sun.source.tree.AnnotatedTypeTree,%20P)
--exclude-redirect http://2006.ecoop.org/->http://www.emn.fr/x-info/ecoop2006/
--exclude-redirect http://en.wikipedia.org/->http://en.wikipedia.org/wiki/Main_Page	
--exclude-redirect http://groups.google.com/->http://groups-beta.google.com/
--exclude-redirect http://ieeexplore.ieee.org/->http://ieeexplore.ieee.org/Xplore/home.jsp
--exclude-redirect http://libraries.mit.edu/get/ieee->http://aeryn.mit.edu/emetrics/count.php?http://libproxy.mit.edu/login?url=http://ieeexplore.ieee.org/	
--exclude-redirect http://libraries.mit.edu/get/lncs->http://aeryn.mit.edu/emetrics/count.php?http://libproxy.mit.edu/login?url=http://www.springerlink.com/openurl.asp?genre=journal&issn=0302-9743	
--exclude-redirect http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore/html/_core_building_on_the_command_line.3a_.overview.asp->http://msdn.microsoft.com/library/shared/deeptree/bot/bot.asp?dtcnfg=/library/deeptreeconfig.xml
--exclude-redirect http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore/html/_core_building_on_the_command_line.3a_.overview.asp->http://msdn.microsoft.com/library/shared/deeptree/bot/bot.asp?dtcnfg=/library/deeptreeconfig.xml
--exclude-redirect http://pag.csail.mit.edu/~smcc/->http://people.csail.mit.edu/people/smcc/
--exclude-redirect http://pag/daikon/mit/log2html.php->http://www.pag.csail.mit.edu/daikon/mit/log2html.php
--exclude-redirect http://student.mit.edu/catalog/index.cgi->http://student.mit.edu/@8178100.17571/catalog/index.cgi	
--exclude-redirect http://texi2html.cvshome.org/->http://ximbiot.com/cvs/
--exclude-redirect http://www.a1trails.com/xc_ski/xc_ma.html->http://www.a1trails.com/home/nospiders.html
--exclude-redirect http://www.a1trails.com/xc_ski/xc_nh.html->http://www.a1trails.com/home/nospiders.html
--exclude-redirect http://www.amtrak.com/->http://www.amtrak.com/servlet/ContentServer?pagename=Amtrak/HomePage
--exclude-redirect http://www.aro.ncren.net/->http://www.arl.army.mil/main/main/default.cfm?Action=29&Page=29	
--exclude-redirect http://www.computer.org/tse/->http://www.computer.org/portal/site/transactions/index.jsp?pageID=tse_home/
--exclude-redirect http://www.cs.cmu.edu/~dpelleg/kmeans.html->http://www-2.cs.cmu.edu/~dpelleg/kmeans.html
--exclude-redirect http://www.cs.utexas.edu/->http://www.cs.utexas.edu/home/department/welcome.html
--exclude-redirect http://www.csail.mit.edu/->http://www.csail.mit.edu/index.php
--exclude-redirect http://www.dexonline.com/->http://www.dexonline.com/displayhome.ds
--exclude-redirect http://www.fair.org/->http://www.fair.org/index.php	
--exclude-redirect http://www.hotelatmit.com/->http://www.hotelatmit.com/
--exclude-redirect http://www.ibm.com/developerworks/oss/jikes/->http://www-124.ibm.com/developerworks/oss/jikes/
--exclude-redirect http://www.ibm.com/developerworks/oss/jikes/->http://www-124.ibm.com/developerworks/oss/jikes/
--exclude-redirect http://www.jmlspecs.org/->http://www.cs.iastate.edu/~leavens/JML/
--exclude-redirect http://www.jmlspecs.org/->http://www.eecs.ucf.edu/~leavens/JML/
--exclude-redirect http://www.nsf.gov/home/cise/->http://www.nsf.gov/dir/index.jsp?org=CISE
--exclude-redirect http://www.pricegrabber.com/->http://www.pricegrabber.com/spiders.php
--exclude-redirect http://www.rational.com/licensing->http://www-306.ibm.com/software/rational/support/licensing/
--exclude-redirect http://www.rational.com/licensing/->http://www-306.ibm.com/software/rational/support/licensing/
--exclude-redirect http://www.usps.gov/ncsc/lookups/lookup_zip%2B4.html->http://www.usps.com/ncsc/lookups/lookup_zip%2b4.html	
--exclude-redirect https://tree-api.dev.java.net/->https://www.dev.java.net/servlets/Login?cookieCheck=failed
--exclude-redirect-prefix http://pag.csail.mit.edu/->http://groups.csail.mit.edu/pag/
--exclude-redirect-prefix http://pag.csail.mit.edu/->http://www.pag.csail.mit.edu/
--exclude-redirect-prefix http://pag.csail.mit.edu/~->http://people.csail.mit.edu/
--exclude-redirect-prefix http://texi2html.cvshome.org/->https://texi2html.cvshome.org/
--exclude-redirect-prefix http://www.csail.mit.edu/~->http://people.csail.mit.edu/

Received on Friday, 17 October 2008 07:14:31 UTC