- From: John Punin <puninj@cs.rpi.edu>
- Date: Thu, 29 Jul 1999 13:41:50 -0400 (EDT)
- To: guy.ferran@ardentsoftware.fr (Guy Ferran)
- Cc: puninj@cs.rpi.edu (John Punin), www-lib@w3.org
> > Hi John, > > Here is the command line I am using: > > > run -vu -q -ss -n -depth 99 \ > -exclude > '/ArchiveBrowser/|/History/|/member/|/team/|\.gz$|\.tar$|\.tgz$|\.Z$|\.zip$|\.ZIP$|\.exe$|\.EXE$|\.ps$|\.doc$|\.pdf$|\.xplot$|\.java$|\.c$|\.h$|\.ppt$|\.gif$|\.GIF$|\.tiff$|\.png$|\.PNG$|\.jpeg$|\.jpg$|\.JPE$' > \ > -prefix http:// \ > -l robot2-log-clf.txt \ > -alt robot2-log-alt.txt \ > -hit robot2-log-hit.txt \ > -rellog robot2-log-link-relations.txt -relation stylesheet \ > -lm robot2-log-lastmodified.txt \ > -title robot2-log-title.txt \ > -referer robot2-log-referer.txt \ > -negotiated robot2-log-negotiated.txt \ > -404 robot2-log-notfound.txt \ > -reject robot2-log-reject.txt \ > -format robot2-log-format.txt \ > -charset robot2-log-charset.txt \ > -cache \ > -timeout 60 \ > http://www.xmltree.com > > Hi Guy You can run out of memory if webbot is "visiting" other web sites besides xmltree. I recommend the following: 1) -prefix http://www.xmltree.com/ 2) The initial URL http://www.xmltree.com/ (use slash at the end) 3) use the flag -redir 4) write a robots.txt to exclude directories: /ArchiveBrowser/|/History/|/member/|/team/| Best wishes John Punin
Received on Thursday, 29 July 1999 13:41:56 UTC