Re: Checklink from Ville Skyttä on 2004-09-27 (www-validator@w3.org from September 2004)

From: Ville Skyttä <ville.skytta@iki.fi>
Date: Mon, 27 Sep 2004 20:58:55 +0300
To: "Jason R. Leveille" <Jason_R._Leveille@fc.mcps.k12.md.us>
Cc: www-validator@w3.org, buzzoff101@yahoo.com, webteam@qohs.org
Message-Id: <1096307934.25052.45.camel@bobcat.mine.nu>

On Mon, 2004-09-27 at 02:49, Jason R. Leveille wrote:

> First of all, wonderful products.

Thanks!

> When I check the links on my pages almost every link checks results in:
> 
> Checking link http://www.qohs.org/qowebsite/valid/
> HEAD http://www.qohs.org/qowebsite/valid/
> GET http://www.qohs.org/error/index.php fetched in 2.3s 
> 
> What concerns me is that almost each link check results in a GET of my
> error page.  This significantly slows down the link check process.  Is
> this something I've done?  When check manually, all the links on the
> particular page in the example above work, but the resulting link check
> gets the error page.

That's sort of a glitch in the link checker's results UI.  Before
checking the actual link, the link checker fetches /robots.txt to see if
it's allowed to do that, which yields (on my local instance, but it's
practically the same on validator.w3.org):

| GET /robots.txt HTTP/1.1
| [...]
|
| HTTP/1.1 302 Found
| Date: Mon, 27 Sep 2004 17:40:59 GMT
| Server: Apache/1.3.29 (Unix)
| Location: http://www.qohs.org/error/index.php
| [...]

As currently implemented, it is the intention that nothing resulting
from fetching /robots.txt would be shown in the link checker's results,
but this one "leaks" through.  Dunno if it's bad or not, though :)  (But
certainly it is somewhat misleading.)

Adding a /robots.txt to the root of www.qohs.org would cause the link
checker not to fetch the error page (and ditto for all other web
crawlers that respect robots exclusion rules, BTW).  See
http://www.robotstxt.org/ for more information.

On a side note (mainly to self), the "302 Found" redirect for
/robots.txt (pointing at a non-robots.txt content) seems to trigger a
bug in the underlying libwww-perl library, which causes the link checker
to try to retrieve /robots.txt for all links on that server.  It is
supposed to do that only once for each server during a single link check
run (although due to some issues in the current implementation, it might
sometimes do it more than once, but certainly not every time).

Received on Monday, 27 September 2004 17:58:58 UTC