- From: Pinnaka, Muralikrishna <PinnakaM5757@cl.uh.edu>
- Date: Fri, 7 Nov 2003 13:01:27 -0600
- To: "'www-validator@w3.org'" <www-validator@w3.org>
Dear sir/mam we are working on the capstone project to cache 1 million pages from us and canadian universities. Our first aim is to down load 800 to 1000 pages from each university web sites( before we applying the filtering algorithm on the cache). When we use checklink.pl on www.mit.edu we are getting 450 links( links which got processed) from the first 3 levels. At the same time when we check for www.cl.uh.edu for the same three levels we are getting only 54 documents. Even for the 4 level we are not getting more than 60 to 70 documents. Is there any way to process even the url with missing trailing slashes and redirected urls( final url after all redirections if multiple ). We dont want to consider those redirections as an error and wants to increase the no of links for processing. we could not understand from the code how to skip these kind of errors and start processing those links also. We obsereved that when we process http://www.cl.uh.edu all the internal links contain the different host strings ( different from cl.uh.edu like http://www.uhcl.edu/admissions and http://nas.cl.uh.edu/). we want to process all those urls which are linked to the main web site. Could you suggest some solution to this problem. we really appreciate for your help and your precious time. Thanks in advance sincerely, Muralikrishna.
Received on Friday, 7 November 2003 14:00:15 UTC