Re: recursive url retriever

On Mon, 2 Dec 2002, Emmanuel Saracco wrote:

> once again: could anybody send me a simple recursive uri load with depth
> control using libwww?

If you are looking for simple, why use libwww =;>  

  [bancroft@res:/usr/local/src/w3c-libwww-5.4.0/Robot/src]$ grep HTQueue *.c
  HTQueue.c:**	@(#) $Id: HTQueue.c,v 1.1 1998/10/26 22:45:34 frystyk Exp $
  HTQueue.c:#include "HTQueue.h"
  HTQueue.c:HTList * HTQueue_new(void)
  HTQueue.c:BOOL HTQueue_delete(HTList *me)
  HTQueue.c:BOOL HTQueue_enqueue(HTList *me,void *newObject)
  HTQueue.c:BOOL HTQueue_append(HTList *me,void *newObject)
  HTQueue.c:BOOL HTQueue_dequeue(HTList *me)
  HTQueue.c:BOOL HTQueue_isEmpty(HTList *me)
  HTQueue.c:void * HTQueue_headOfQueue(HTList *me)
  HTQueue.c:int HTQueue_count(HTList *me)
  HTRobot.c:#include "HTQueue.h"
  HTRobot.c:    me->queue = HTQueue_new();
  HTRobot.c:	if (mr->queue) HTQueue_delete(mr->queue);
  HTRobot.c:		HTQueue_append(mr->queue, (void *) nhd);
  HTRobot.c:	HTQueue_append(mr->queue, (void *)hd); (mr->cq)++;
  HTRobot.c:      if(!HTQueue_isEmpty(mr->queue))
  HTRobot.c:	  HyperDoc *nhd = (HyperDoc *)HTQueue_headOfQueue(mr->queue);
  HTRobot.c:	      HTQueue_dequeue(mr->queue); (mr->cq)--;
  HTRobot.c:		HTQueue_enqueue(mr->queue, (void *) nhd);

The basic idea is to queue urls that match the pattern rather than doing it
recursively.  This allows the robot to separate out the tasks of fetching urls,
deciding whether to follow the links and actually processing the queue (all
without having a runaway stack).

Perhaps if you turn the HT_MYSQL definition on it will be easier to follow . . .

more,
l8r,

------------------------------------------------------------------- 
Victor Bancroft, Principal Engineer, Zvolve Systems [v]770.551.4505 
1050 Crown Pointe Pkwy, Suite 300, Atlanta GA 30338 [f]770.551.4509 
Fellow, Artificial Intelligence Center              [v]706.542-0358 
Athens, Georgia  30602, U.S.A           http://ai.uga.edu/~bancroft 

Received on Monday, 2 December 2002 10:54:10 UTC