W3C home > Mailing lists > Public > www-lib@w3.org > January to March 2000

Re: webbot problem

From: Amir Michail <amir@cs.washington.edu>
Date: Mon, 24 Jan 2000 14:31:59 -0800
To: www-lib@w3.org
Message-Id: <00012414350400.00719@nishin.cs.washington.edu>
Hi,

As a follow up on this problem, I should note that the urls that fail with
error code 1 all come at the end of the crawl. Perhaps there is some apache
limit (to prevent denial of service attacks)?  Perhaps the name servers are
being overloaded?  Any ideas? 

Also, I had no luck running webbot on solaris.  Most runs ended in a broken
pipe.

Amir

On Sun, 23 Jan 2000, Amir Michail wrote:
> Hi,
> 
> I tried to use webbot to crawl our dept. web site but for some
> reason, some urls give error code 1 in the clf log and are not followed.
> 
> These are correct urls that work fine in any browser.
> 
> Moreover, if I crawl closer to the url (e.g., not the entrance of the site),
> there are no problems.
> 
> I tried increasing the crawl depth significantly and this seems to have make no
> difference.  I also have redirection turned on.
> 
> I am using version 5.2.9 compiled for Linux.
> 
> Amir
> 
> P.S.  Here is the command invocation:
> 
> ${ROBOT} ${FLAGS} -nopipe -redir -ss -n -depth 5 \
> -exclude "\?|/hype/|/mailing-list/|/archive/" \
> -check "\.wav$|\.dvi$|\.tex$|\.au$|\.eps$|\.mp3$|\.mov$|\.qt$|\.gz$|\.tar$|\.tgz$|\.Z$|\.zip$|\.ZIP$|\.exe$|\.EXE$|\.ps$|\.doc$|\.pdf$|\.xplot$|\.java$|\.c$|\.h$|\.txt$|\.ppt$|\.gif$|\.GIF$|\.tiff$|\.png$|\.PNG$|\.jpeg$|\.jpg$|\.JPE$" \
> -prefix ${ROOT} \
> -img -imgprefix ${IMGROOT} \
> -l ${LOG}-log-clf.txt \
> -referer ${LOG}-log-referer.txt \
> -404 ${LOG}-log-notfound.txt \
> -reject ${LOG}-log-reject.txt \
> ${ROOT}
Received on Monday, 24 January 2000 17:35:06 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 23 April 2007 18:18:35 GMT