W3C home > Mailing lists > Public > www-lib@w3.org > July to September 1999

Robot crash, another voice!

From: Frank Wood <fwood@tofish.net>
Date: Fri, 3 Sep 1999 18:39:06 -0400
Message-ID: <01BEF63B.9AB96FE0.fwood@tofish.net>
To: "'www-lib@w3c.org'" <www-lib@w3c.org>
Hello All,

    I'm evaluating libwww for use in a crawler we're developing so I 
thought I'd check out what's already been written (webbot).  I'm not too 
happy with what I've found thusfar and am seeking some help.

I'm trying to run the latest CVS version of webbot.exe compiled under MSVC 
5.0 on NT 4.0 sp4 but I have been unsuccessful in getting it to run for any 
significant amount of time at all.  (<5 minutes)  My machine never comes 
close to hitting out of memory conditions and it happens on every site I've 
tried.

command line:

webbot http://"bigasssite"/ -depth 10 -norobots -prefix http -bfs

I understand that this is resource intensive and could piss off the owner 
of said "bigasssite" if they closely examined their logs; however, all I 
want to see is that the crawler will run until my machine runs out of 
resource (memory in particular).  To my chagrin, I have not been able to 
run it without it crashing at:

HTTChunk.c:55, %s=line doesn't print out particularly well on the dos 
shell, but it's, in one case "gn="middle">"

So, what's the scoop?  My bet is that an edge byte or two is misinterpreted 
during de-chunking and the decoder gets fooled into thinking its looking at 
a header instead of document body.  Either that or the server barfed up 
something formatted badly and the de-chunking process failed un-gracefully.

Fixing this little buggy would make me very, very happy.  Let me know if I 
can help.

Also, in the process of trying to get around this I discovered that 
revision 2.5 of HTCookie.c doesn't work with the current Win32 Makefiles. 
 I didn't fix it, just rolled that rev. back to 2.4 and everything with the 
exception of the forementioned problems, is (compiles) fine.

Thanks,

Frank Wood                   (mailto:fwood@tofish.net)
ToFish! Incorporated    (http://www.tofish.net/)
2121 K St.NW Suite 800
Washington, DC 20037
PH (202) 261-3591
FAX (202) 261-3592
Received on Friday, 3 September 1999 18:30:26 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 5 February 2014 07:15:17 UTC