Re: w3c::checklink -> Need a suggestion from Liam Quinn on 2003-11-08 (www-validator@w3.org from November 2003)

From: Liam Quinn <liam@htmlhelp.com>
Date: Fri, 7 Nov 2003 21:39:27 -0500 (EST)
To: "Pinnaka, Muralikrishna " <PinnakaM5757@cl.uh.edu>
Cc: "'www-validator@w3.org'" <www-validator@w3.org>
Message-ID: <Pine.LNX.4.44.0311072131070.1594-100000@localhost.localdomain>

On Fri, 7 Nov 2003, Pinnaka, Muralikrishna  wrote:

> we are working on the capstone project to cache 1 million pages from us and
> canadian universities. Our first aim is to down load  800 to 1000 pages from
> each university web sites( before we applying the filtering algorithm on the
> cache). When we use checklink.pl on www.mit.edu we are getting 450 links(
> links which got processed) from the first 3 levels.

When spidering a site, you really need to follow the Robots Exclusion
Protocol.  Checklink unfortunately does not.  Therefore, you should not
use checklink for your project unless you fix it to follow the Robots
Exclusion Protocol.

I suggest using a tool such as "wget" instead.  Wget allows you to span 
hosts and specify which hosts to include or exclude.

-- 
Liam Quinn

Received on Friday, 7 November 2003 21:37:54 UTC