w3c::checklink -> Need a suggestion

Dear sir/mam

we are working on the capstone project to cache 1 million pages from us and
canadian universities. Our first aim is to down load  800 to 1000 pages from
each university web sites( before we applying the filtering algorithm on the
cache). When we use checklink.pl on www.mit.edu we are getting 450 links(
links which got processed) from the first 3 levels. At the same time when we
check for www.cl.uh.edu for the same three levels we are getting only 54
documents. Even for the 4 level we are not getting more than 60 to 70
documents. Is there any way to process even the url with missing trailing
slashes and redirected urls( final url after all redirections if multiple ).
We dont want to consider those redirections as an error and wants to
increase the no of links for processing. we could not understand from the
code how to skip these kind of errors and start processing those links also.
We obsereved that when we process http://www.cl.uh.edu all the internal
links contain the different host strings ( different from cl.uh.edu like
http://www.uhcl.edu/admissions and   http://nas.cl.uh.edu/). we want to
process all those urls which are linked to the main web site.

Could you suggest some solution to this problem. we really appreciate for
your help and your precious time.

Thanks in advance


Received on Friday, 7 November 2003 14:00:15 UTC