Re: Robot crash from John Punin on 1999-07-29 (www-lib@w3.org from July to September 1999)

From: John Punin <puninj@cs.rpi.edu>
Date: Thu, 29 Jul 1999 13:41:50 -0400 (EDT)
To: guy.ferran@ardentsoftware.fr (Guy Ferran)
Cc: puninj@cs.rpi.edu (John Punin), www-lib@w3.org
Message-Id: <199907291741.NAA26610@dishwasher.cs.rpi.edu>

> 
> Hi John,
> 
> Here is the command line I am using:
> 
> 
> run -vu -q -ss -n -depth 99 \
> -exclude
> '/ArchiveBrowser/|/History/|/member/|/team/|\.gz$|\.tar$|\.tgz$|\.Z$|\.zip$|\.ZIP$|\.exe$|\.EXE$|\.ps$|\.doc$|\.pdf$|\.xplot$|\.java$|\.c$|\.h$|\.ppt$|\.gif$|\.GIF$|\.tiff$|\.png$|\.PNG$|\.jpeg$|\.jpg$|\.JPE$'
> \
> -prefix http:// \
> -l robot2-log-clf.txt \
> -alt robot2-log-alt.txt \
> -hit robot2-log-hit.txt \
> -rellog robot2-log-link-relations.txt -relation stylesheet \
> -lm robot2-log-lastmodified.txt \
> -title robot2-log-title.txt \
> -referer robot2-log-referer.txt \
> -negotiated robot2-log-negotiated.txt \
> -404 robot2-log-notfound.txt \
> -reject robot2-log-reject.txt \
> -format robot2-log-format.txt \
> -charset robot2-log-charset.txt  \
> -cache \
> -timeout 60 \
> http://www.xmltree.com
> 
> 
Hi Guy
You can run out of memory if webbot is "visiting" other web sites besides
xmltree. I recommend the following:
1)  -prefix http://www.xmltree.com/
2) The initial URL http://www.xmltree.com/ (use slash at the end)
3) use the flag -redir
4) write a robots.txt to exclude directories: /ArchiveBrowser/|/History/|/member/|/team/|

Best wishes
John Punin

Received on Thursday, 29 July 1999 13:41:56 UTC