- From: Michael Ernst <mernst@alum.mit.edu>
- Date: Fri, 17 Oct 2008 09:13:48 +0200
- To: www-validator@w3.org
- Message-ID: <18680.15148.257035.521000@swsmde.ds.mpi-sws.mpg.de>
Sometimes, a user expects that checklink will produce certain warnings. Some reasons include robot exclusion rules, password-protected content, and errors in automatically-generated content. A user would prefer checklink to show only the unexpected warnings, rather than hiding them in an avalance of uninteresting output. This patch adds flags that suppress certain warnings. These flags complement the existing --exclude and --exclude-docs flags. (The patch also permits --exclude-docs to be supplied multiple times instead of just once.) Here is a snippet from the (new) "checklink --help" output. --exclude-redirect URI->URI Do not report a redirect from the first to the second URI. The "->" is literal text. --exclude-redirect-prefix URI->URI Do not report a redirect from a child of the first URI to the same child of the second URI. The "->" is literal text. --exclude-broken CODE:URI Do not report a broken link with the given CODE. CODE is HTTP response, or -1 for robots exclusion. The ":" is literal text. --exclude-fragment URL#FRAG Do not report the given broken fragment. The "#" is literal text. With this patch, I am able to regularly check large sets of webpages for broken links, with no warning output in the common case. Below the patch, I have attached the patch, and also an example of some arguments that I pass to checklink. -Michael Ernst
Index: checklink =================================================================== RCS file: /sources/public/perl/modules/W3C/LinkChecker/bin/checklink,v retrieving revision 4.118 diff -u -u -b -r4.118 checklink --- checklink 17 Oct 2008 04:36:19 -0000 4.118 +++ checklink 17 Oct 2008 07:09:13 -0000 @@ -370,6 +370,10 @@ Base_Locations => [], Exclude => undef, Exclude_Docs => undef, + Exclude_Redirect => undef, + Exclude_Redirect_Prefix => undef, + Exclude_Broken => undef, + Exclude_Fragment => undef, Masquerade => 0, Masquerade_From => '', Masquerade_To => '', @@ -401,10 +405,6 @@ eval { $Opts{Exclude} = qr/$Opts{Exclude}/o; }; &usage(1, "Error in exclude regexp: $@") if $@; } -if (defined($Opts{Exclude_Docs})) { - eval { $Opts{Exclude_Docs} = qr/$Opts{Exclude_Docs}/o; }; - &usage(1, "Error in exclude-docs regexp: $@") if $@; -} if (defined($Opts{Trusted})) { eval { $Opts{Trusted} = qr/$Opts{Trusted}/io; }; &usage(1, "Error in trusted domains regexp: $@") if $@; @@ -616,7 +616,11 @@ if $Opts{Depth} == 0; }, 'l|location=s' => \@locs, 'X|exclude=s', => \$Opts{Exclude}, - 'exclude-docs=s', => \$Opts{Exclude_Docs}, + 'exclude-docs=s@', => \$Opts{Exclude_Docs}, + 'exclude-redirect=s@', => \$Opts{Exclude_Redirect}, + 'exclude-redirect-prefix=s@', => \$Opts{Exclude_Redirect_Prefix}, + 'exclude-broken=s@', => \$Opts{Exclude_Broken}, + 'exclude-fragment=s@', => \$Opts{Exclude_Fragment}, 'u|user=s' => \$Opts{User}, 'p|password=s' => \$Opts{Password}, 't|timeout=i' => \$Opts{Timeout}, @@ -699,6 +703,16 @@ as --exclude-docs with the same regexp would. --exclude-docs REGEXP In recursive mode, do not check links in documents whose full, canonical URIs match REGEXP. + --exclude-redirect URI->URI Do not report a redirect from the first to the + second URI. The \"->\" is literal text. + --exclude-redirect-prefix URI->URI Do not report a redirect from a child of + the first URI to the same child of the second + URI. The \"->\" is literal text. + --exclude-broken CODE:URI Do not report a broken link with the given CODE. + CODE is HTTP response, or -1 for robots exclusion. + The \":\" is literal text. + --exclude-fragment URL#FRAG Do not report the given broken fragment. + The \"#\" is literal text. -L, --languages LANGS Accept-Language header to send. The special value 'auto' causes autodetection from the environment. -R, --no-referer Do not send the Referer HTTP header. @@ -1202,9 +1216,14 @@ my $candidate = URI->new($uri)->canonical(); - return 0 - if ((defined($Opts{Exclude}) && $candidate =~ $Opts{Exclude}) || - (defined($Opts{Exclude_Docs}) && $candidate =~ $Opts{Exclude_Docs})); + return 0 if (defined($Opts{Exclude}) && $candidate =~ $Opts{Exclude}); + if (defined($Opts{Exclude_Docs})) { + for my $excluded_doc (@{$Opts{Exclude_Docs}}) { + if ($candidate =~ $excluded_doc) { + return 0; + } + } + } foreach my $base (@{$Opts{Base_Locations}}) { my $rel = $candidate->rel($base); @@ -1213,7 +1232,7 @@ return 1; } - return 0; # We always have at least one base location. + return 0; # We always have at least one base location, but none matched. } ################################################## @@ -1359,6 +1378,19 @@ $results{$uri}{location}{orig_message} = $tmp->message() || '(no message)'; } $results{$uri}{location}{success} = $response->is_success(); + + # If a suppressed broken link, fill the data structure like a typical success. + # print STDERR "success? " . $results{$uri}{location}{success} . ": $uri\n"; + if (! $results{$uri}{location}{success}) { + my $code = $results{$uri}{location}{code}; + my $match = grep { $_ eq "$code:$uri" } @{$Opts{Exclude_Broken}}; + if ($match) { + $results{$uri}{location}{success} = 1; + $results{$uri}{location}{code} = 100; + $results{$uri}{location}{display} = 100; + } + } + # Stores the authentication information if (defined($response->{Realm})) { $results{$uri}{location}{realm} = $response->{Realm}; @@ -1728,7 +1760,8 @@ # Check that the fragments exist foreach my $fragment (keys %{$links->{$uri}{fragments}}) { if (defined($p->{Anchors}{$fragment}) - || &escape_match($fragment, $p->{Anchors})) { + || &escape_match($fragment, $p->{Anchors}) + || grep { $_ eq "$uri#$fragment" } @{$Opts{Exclude_Fragment}}) { $results{$uri}{fragments}{$fragment} = 1; } else { $results{$uri}{fragments}{$fragment} = 0; @@ -1822,6 +1855,44 @@ { my ($redirects, $response) = @_; for (my $prev = $response->previous(); $prev; $prev = $prev->previous()) { + + # Check for redirect match. + my $from = $prev->request()->url(); + my $to = $response->request()->url(); # same on every loop iteration + my $from_to = $from . '->' . $to; + my $match = grep { $_ eq $from_to } @{$Opts{Exclude_Redirect}}; + # print STDERR "Result $match of checking $from_to\n"; + if ($match) { next; } + + # Check for redirect_prefix match + my $prefix_match = 0; + my $from_len = length($from); + my $to_len = length($to); + for my $redir_prefix (@{$Opts{Exclude_Redirect_Prefix}}) { + if ($redir_prefix !~ /^(.*)->(.*)$/) { + die "Bad exclude-redirect-prefix: $redir_prefix"; + } + my $from_prefix = $1; + my $to_prefix = $2; + my $from_prefix_len = length($from_prefix); + my $to_prefix_len = length($to_prefix); + if (($from eq $from_prefix) && ($to eq $to_prefix)) { + $prefix_match = 1; + last; + } elsif (($from_prefix_len < $from_len) + && ($to_prefix_len < $to_len) + && ($from_prefix eq substr($from, 0, $from_prefix_len)) + && ($to_prefix eq substr($to, 0, $to_prefix_len)) + && (substr($from, $from_prefix_len) eq substr($to, $to_prefix_len))) { + $prefix_match = 1; + last; + } + } + if ($prefix_match) { + # print STDERR "AN EXCLUDED REDIRECT:\n $from\n $to\n"; + next; + } + $redirects->{$prev->request()->url()} = $response->request()->url(); } return;
--exclude-broken -1:http://whereis.mit.edu/map-jpg?selection=32&Buildings=go --exclude-broken 302:MAY-NEED-TO-ALSO-LIST-IN-exclude-redirect-CLAUSE --exclude-broken 302:http://ieeexplore.ieee.org/ --exclude-broken 302:http://www.hotelatmit.com/ --exclude-broken 403:http://validator.w3.org/check?uri=referer --exclude-broken 403:http://www.acm.org/ --exclude-broken 403:http://www.acm.org/sigs/volunteer_resources/conference_manual/ --exclude-broken 403:http://www.cs.washington.edu/orgs/student-affairs/gsc/jobs/ --exclude-broken 403:http://www.elsevier.nl/locate/disc/ --exclude-broken 403:https://www.csail.mit.edu/mrbs/ --exclude-broken 404:file://afs/csail/group/pag/software/pkg/freshmeat-submit-1.6/freshmeat-submit.html --exclude-broken 404:http://groups.google.com/group/jsr-305/ --exclude-broken 404:http://groups.google.com/group/jsr-305/web/proposed-annotations --exclude-broken 404:http://java.sun.com/javase/6/docs/jdk/api/javac/tree/com/sun/source/tree/AnnotatedTypeTree.html?is-external=true --exclude-broken 404:http://www.eclipse.org/legal/Eclipse%20EPL%202003_11_10%20Final_files/filelist.xml --exclude-broken 405:http://www.amazon.com/exec/obidos/tg/detail/-/0321205758/103-2932545-5299831?v=glance --exclude-broken 412:http://validator.w3.org/check?uri=referer --exclude-broken 500:http://conferences.iee.org/icse2004/ --exclude-broken 500:http://jcp.org/en/jsr/detail?id=308 --exclude-broken 500:http://www.orbitz.com/ --exclude-broken 500:https://ca.mit.edu:444/moira/showresult.jhtml?list=parg&operation=displaylistinfo --exclude-broken 500:https://eecsfacweb.mit.edu/ --exclude-broken 500:https://web.mit.edu/21.guide/www/l-rec-ob.htm --exclude-broken 500:https://web.mit.edu/21.guide/www/l-rec-wr.htm --exclude-broken 500:https://web.mit.edu/21.guide/www/toc.htm --exclude-broken 500:https://web.mit.edu/6.033/www/staff/ --exclude-broken 500:https://web.mit.edu/6.170/staff/ --exclude-broken 500:https://web.mit.edu/6.170/staff/staging/www/ --exclude-broken 500:https://www.cvshome.org/docs/manual/cvs-1.11.18/cvs_5.html --exclude-broken 501:http://www.bizrate.com/ --exclude-broken 501:https://web.mit.edu/21.guide/www/l-rec-wr.htm --exclude-broken 501:https://web.mit.edu/21.guide/www/toc.htm --exclude-broken 503:http://www.marriott.com/hotels/travel/boscb-boston-marriott-cambridge/ --exclude-broken 503:http://www.marriott.com/hotels/travel/boscm-residence-inn-boston-cambridge/ --exclude-docs /~adonovan/ --exclude-docs bugzilla/ --exclude-docs daikon/download/jdoc --exclude-docs http://pag.csail.mit.edu/jsr308/dist/doc/javac_lifecycle --exclude-docs mernst/(public_html/)?(ir95|advice/conference/) --exclude-fragment http://groups.csail.mit.edu/pag/jsr308/current/doc/checkers/igj/quals/I.html#annotation_type_element_detail --exclude-fragment http://groups.csail.mit.edu/pag/jsr308/current/doc/checkers/quals/DefaultQualifiers.html#annotation_type_element_detail --exclude-fragment http://groups.csail.mit.edu/pag/jsr308/current/doc/checkers/quals/ImplicitFor.html#annotation_type_element_detail --exclude-fragment http://java.sun.com/javase/6/docs/api/javax/lang/model/SourceVersion.html?is-external=true#RELEASE_7 --exclude-fragment http://java.sun.com/javase/6/docs/jdk/api/javac/tree/com/sun/source/util/SimpleTreeVisitor.html?is-external=true#visitAnnotatedType(com.sun.source.tree.AnnotatedTypeTree,%20P) --exclude-fragment http://java.sun.com/javase/6/docs/jdk/api/javac/tree/com/sun/source/util/SimpleTreeVisitor.html?is-external=true#visitAnnotatedType(com.sun.source.tree.AnnotatedTypeTree,%20P) --exclude-fragment http://java.sun.com/javase/6/docs/jdk/api/javac/tree/com/sun/source/util/TreeScanner.html?is-external=true#visitAnnotatedType(com.sun.source.tree.AnnotatedTypeTree,%20P) --exclude-redirect http://2006.ecoop.org/->http://www.emn.fr/x-info/ecoop2006/ --exclude-redirect http://en.wikipedia.org/->http://en.wikipedia.org/wiki/Main_Page --exclude-redirect http://groups.google.com/->http://groups-beta.google.com/ --exclude-redirect http://ieeexplore.ieee.org/->http://ieeexplore.ieee.org/Xplore/home.jsp --exclude-redirect http://libraries.mit.edu/get/ieee->http://aeryn.mit.edu/emetrics/count.php?http://libproxy.mit.edu/login?url=http://ieeexplore.ieee.org/ --exclude-redirect http://libraries.mit.edu/get/lncs->http://aeryn.mit.edu/emetrics/count.php?http://libproxy.mit.edu/login?url=http://www.springerlink.com/openurl.asp?genre=journal&issn=0302-9743 --exclude-redirect http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore/html/_core_building_on_the_command_line.3a_.overview.asp->http://msdn.microsoft.com/library/shared/deeptree/bot/bot.asp?dtcnfg=/library/deeptreeconfig.xml --exclude-redirect http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vccore/html/_core_building_on_the_command_line.3a_.overview.asp->http://msdn.microsoft.com/library/shared/deeptree/bot/bot.asp?dtcnfg=/library/deeptreeconfig.xml --exclude-redirect http://pag.csail.mit.edu/~smcc/->http://people.csail.mit.edu/people/smcc/ --exclude-redirect http://pag/daikon/mit/log2html.php->http://www.pag.csail.mit.edu/daikon/mit/log2html.php --exclude-redirect http://student.mit.edu/catalog/index.cgi->http://student.mit.edu/@8178100.17571/catalog/index.cgi --exclude-redirect http://texi2html.cvshome.org/->http://ximbiot.com/cvs/ --exclude-redirect http://www.a1trails.com/xc_ski/xc_ma.html->http://www.a1trails.com/home/nospiders.html --exclude-redirect http://www.a1trails.com/xc_ski/xc_nh.html->http://www.a1trails.com/home/nospiders.html --exclude-redirect http://www.amtrak.com/->http://www.amtrak.com/servlet/ContentServer?pagename=Amtrak/HomePage --exclude-redirect http://www.aro.ncren.net/->http://www.arl.army.mil/main/main/default.cfm?Action=29&Page=29 --exclude-redirect http://www.computer.org/tse/->http://www.computer.org/portal/site/transactions/index.jsp?pageID=tse_home/ --exclude-redirect http://www.cs.cmu.edu/~dpelleg/kmeans.html->http://www-2.cs.cmu.edu/~dpelleg/kmeans.html --exclude-redirect http://www.cs.utexas.edu/->http://www.cs.utexas.edu/home/department/welcome.html --exclude-redirect http://www.csail.mit.edu/->http://www.csail.mit.edu/index.php --exclude-redirect http://www.dexonline.com/->http://www.dexonline.com/displayhome.ds --exclude-redirect http://www.fair.org/->http://www.fair.org/index.php --exclude-redirect http://www.hotelatmit.com/->http://www.hotelatmit.com/ --exclude-redirect http://www.ibm.com/developerworks/oss/jikes/->http://www-124.ibm.com/developerworks/oss/jikes/ --exclude-redirect http://www.ibm.com/developerworks/oss/jikes/->http://www-124.ibm.com/developerworks/oss/jikes/ --exclude-redirect http://www.jmlspecs.org/->http://www.cs.iastate.edu/~leavens/JML/ --exclude-redirect http://www.jmlspecs.org/->http://www.eecs.ucf.edu/~leavens/JML/ --exclude-redirect http://www.nsf.gov/home/cise/->http://www.nsf.gov/dir/index.jsp?org=CISE --exclude-redirect http://www.pricegrabber.com/->http://www.pricegrabber.com/spiders.php --exclude-redirect http://www.rational.com/licensing->http://www-306.ibm.com/software/rational/support/licensing/ --exclude-redirect http://www.rational.com/licensing/->http://www-306.ibm.com/software/rational/support/licensing/ --exclude-redirect http://www.usps.gov/ncsc/lookups/lookup_zip%2B4.html->http://www.usps.com/ncsc/lookups/lookup_zip%2b4.html --exclude-redirect https://tree-api.dev.java.net/->https://www.dev.java.net/servlets/Login?cookieCheck=failed --exclude-redirect-prefix http://pag.csail.mit.edu/->http://groups.csail.mit.edu/pag/ --exclude-redirect-prefix http://pag.csail.mit.edu/->http://www.pag.csail.mit.edu/ --exclude-redirect-prefix http://pag.csail.mit.edu/~->http://people.csail.mit.edu/ --exclude-redirect-prefix http://texi2html.cvshome.org/->https://texi2html.cvshome.org/ --exclude-redirect-prefix http://www.csail.mit.edu/~->http://people.csail.mit.edu/
Received on Friday, 17 October 2008 07:14:31 UTC