perl/modules/W3C/LinkChecker/docs checklink.html,1.20,1.21 from Ville Skytta on 2004-06-09 (www-validator-cvs@w3.org from June 2004)

From: Ville Skytta <ville@dev.w3.org>
Date: Wed, 09 Jun 2004 06:30:07 +0000
To: www-validator-cvs@w3.org
Message-Id: <20040609063007.72CF94A847@hutz.w3.org>

Update of /sources/public/perl/modules/W3C/LinkChecker/docs
In directory hutz:/tmp/cvs-serv21590/docs

Modified Files:
	checklink.html 
Log Message:
Add blurb about robots exclusion implementation details.

Index: checklink.html
===================================================================
RCS file: /sources/public/perl/modules/W3C/LinkChecker/docs/checklink.html,v
retrieving revision 1.20
retrieving revision 1.21
diff -u -d -r1.20 -r1.21
--- checklink.html	8 Jun 2004 17:15:02 -0000	1.20
+++ checklink.html	9 Jun 2004 06:30:04 -0000	1.21
@@ -214,6 +214,19 @@
 </pre>
 
     <p>
+      Robots exlusion support in the link checker is based on the
+      <a href="http://search.cpan.org/dist/libwww-perl/lib/LWP/RobotUA.pm">LWP::RobotUA</a>
+      Perl module.  It currently supports the
+      "<a href="http://www.robotstxt.org/wc/norobots.html">original 1994 version</a>"
+      of the standard.  The robots META tag, ie.
+      <code>&lt;meta name="robots" content="..."&gt;</code>, is not supported.
+      Other than that, the link checker's implementation goes all the way
+      in trying to honor robots exclusion rules; if a
+      <code>/robots.txt</code> disallows it, not even the first document
+      submitted as the root for a link checker run is fetched.
+    </p>
+
+    <p>
       Note that <code>/robots.txt</code> rules affect only user agents
       that honor it; it is not a generic method for access control.
     </p>

Received on Wednesday, 9 June 2004 02:30:07 UTC