- From: Liam Quinn <liam@htmlhelp.com>
- Date: Fri, 7 Nov 2003 21:39:27 -0500 (EST)
- To: "Pinnaka, Muralikrishna " <PinnakaM5757@cl.uh.edu>
- Cc: "'www-validator@w3.org'" <www-validator@w3.org>
On Fri, 7 Nov 2003, Pinnaka, Muralikrishna wrote: > we are working on the capstone project to cache 1 million pages from us and > canadian universities. Our first aim is to down load 800 to 1000 pages from > each university web sites( before we applying the filtering algorithm on the > cache). When we use checklink.pl on www.mit.edu we are getting 450 links( > links which got processed) from the first 3 levels. When spidering a site, you really need to follow the Robots Exclusion Protocol. Checklink unfortunately does not. Therefore, you should not use checklink for your project unless you fix it to follow the Robots Exclusion Protocol. I suggest using a tool such as "wget" instead. Wget allows you to span hosts and specify which hosts to include or exclude. -- Liam Quinn
Received on Friday, 7 November 2003 21:37:54 UTC