Re: checklink re-checks links in when recursive option is on from Ville Skyttä on 2011-03-21 (www-validator@w3.org from March 2011)

From: Ville Skyttä <ville.skytta@iki.fi>
Date: Mon, 21 Mar 2011 23:59:28 +0200
To: www-validator@w3.org
Message-ID: <4D87CA40.2010001@iki.fi>

On 03/21/2011 05:33 PM, Chris Herdt wrote:
> When running the W3C link checker recursively, it re-checks links that
> it has already checked.

It should not be doing that for all links, are you sure it actually
does?  It does list all encountered links but the ones it doesn't
actually retrieve again don't have a "HEAD $link ..." or "GET $link ..."
line after their "Checking link $link" line.  This should probably be
made clearer in the output.

> E.g.:
> www.foo.com/page2.html links to www.foo.com/index.html
> www.foo.com/page3.html links to www.foo.com/index.html
> 
> Checking the link for www.foo.com/index.html a second (or third, or
> fourth, or one thousandth) time is time-consuming and seems
> unnecessary.

Some links do end up being checked first with a HEAD request as part of
the normal checking process, and then later retrieved second time with a
GET request because at the time the HEAD request was made, it was not
known/realized that the document content was needed (document contents
are needed for the recursion functionality itself, as well as for
checking anchors).  But unless I've forgotten something, no link should
be retrieved more than twice (0 or 1 HEAD request, 0 or 1 GET request).

There could be some low hanging optimization possibilities that would
avoid some of the extra HEAD requests.  Other than that, if you have a
URL and a recipe that can be used to demonstrate that the link checker
does not behave according to this description, please post it and I can
have a look.

Received on Monday, 21 March 2011 22:00:04 UTC