- From: Amir Michail <amir@cs.washington.edu>
- Date: Mon, 24 Jan 2000 14:31:59 -0800
- To: www-lib@w3.org
Hi, As a follow up on this problem, I should note that the urls that fail with error code 1 all come at the end of the crawl. Perhaps there is some apache limit (to prevent denial of service attacks)? Perhaps the name servers are being overloaded? Any ideas? Also, I had no luck running webbot on solaris. Most runs ended in a broken pipe. Amir On Sun, 23 Jan 2000, Amir Michail wrote: > Hi, > > I tried to use webbot to crawl our dept. web site but for some > reason, some urls give error code 1 in the clf log and are not followed. > > These are correct urls that work fine in any browser. > > Moreover, if I crawl closer to the url (e.g., not the entrance of the site), > there are no problems. > > I tried increasing the crawl depth significantly and this seems to have make no > difference. I also have redirection turned on. > > I am using version 5.2.9 compiled for Linux. > > Amir > > P.S. Here is the command invocation: > > ${ROBOT} ${FLAGS} -nopipe -redir -ss -n -depth 5 \ > -exclude "\?|/hype/|/mailing-list/|/archive/" \ > -check "\.wav$|\.dvi$|\.tex$|\.au$|\.eps$|\.mp3$|\.mov$|\.qt$|\.gz$|\.tar$|\.tgz$|\.Z$|\.zip$|\.ZIP$|\.exe$|\.EXE$|\.ps$|\.doc$|\.pdf$|\.xplot$|\.java$|\.c$|\.h$|\.txt$|\.ppt$|\.gif$|\.GIF$|\.tiff$|\.png$|\.PNG$|\.jpeg$|\.jpg$|\.JPE$" \ > -prefix ${ROOT} \ > -img -imgprefix ${IMGROOT} \ > -l ${LOG}-log-clf.txt \ > -referer ${LOG}-log-referer.txt \ > -404 ${LOG}-log-notfound.txt \ > -reject ${LOG}-log-reject.txt \ > ${ROOT}
Received on Monday, 24 January 2000 17:35:06 UTC