W3C home > Mailing lists > Public > www-lib@w3.org > April to June 2001

parsing badly constructed html

From: John Kieti <jkieti@yahoo.com>
Date: Tue, 22 May 2001 10:12:35 -0700 (PDT)
Message-ID: <20010522171235.77351.qmail@web9405.mail.yahoo.com>
To: www-lib@w3.org
Hi

Am trying to collect and parse html using a pages
mainly to collect forward links, link text and
document titles.

Am using a using a link callback function,
textcallback function and two element callback
functions- as seen below in my main parser fucntion.

All seems to work fine until a page that is sort of
not properly constructed (htmlwise) is reached.At such
a point I get a segmentation fault. Is there any way I
could recover from bad documents such as to ignore
them, or does someone have a solution to the seg-fault
problem.

Below is my main function. (Am trying to avoid sending
a very big email full of code - wil this be necessary?
)

bool RobotDoc::parse(){
//Request declared as a member of RobotDoc
 _request = HTRequest_new();
//Register callback functions for extracting info
 HText_registerLinkCallback(foundlink); 	
HText_registerTextCallback(foundtext);

Text_registerElementCallback(beginElement,endElement);
//To stop loop
 HTNet_addAfter(stoplinks, NULL, NULL,HT_ALL,
_FILTER_LAST);
	//Load the document (_uri is a member of the class)
	BOOL status  = HTLoadAbsolute(_url, _request);
	/* Go into the event loop... */
	
	if(status = YES)
		HTEventList_loop(_request);
	HTRequest_setFlush(_request, YES);
	
	return ret;
}


Someone please assist me.
Thanks Kieti

__________________________________________________
Do You Yahoo!?
Yahoo! Auctions - buy the things you want at great prices
http://auctions.yahoo.com/
Received on Tuesday, 22 May 2001 13:12:42 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 23 April 2007 18:18:39 GMT