W3C home > Mailing lists > Public > www-lib@w3.org > April to June 2001

parsing badly constructed html

From: John Kieti <jkieti@yahoo.com>
Date: Tue, 22 May 2001 10:12:35 -0700 (PDT)
Message-ID: <20010522171235.77351.qmail@web9405.mail.yahoo.com>
To: www-lib@w3.org

Am trying to collect and parse html using a pages
mainly to collect forward links, link text and
document titles.

Am using a using a link callback function,
textcallback function and two element callback
functions- as seen below in my main parser fucntion.

All seems to work fine until a page that is sort of
not properly constructed (htmlwise) is reached.At such
a point I get a segmentation fault. Is there any way I
could recover from bad documents such as to ignore
them, or does someone have a solution to the seg-fault

Below is my main function. (Am trying to avoid sending
a very big email full of code - wil this be necessary?

bool RobotDoc::parse(){
//Request declared as a member of RobotDoc
 _request = HTRequest_new();
//Register callback functions for extracting info

//To stop loop
 HTNet_addAfter(stoplinks, NULL, NULL,HT_ALL,
	//Load the document (_uri is a member of the class)
	BOOL status  = HTLoadAbsolute(_url, _request);
	/* Go into the event loop... */
	if(status = YES)
	HTRequest_setFlush(_request, YES);
	return ret;

Someone please assist me.
Thanks Kieti

Do You Yahoo!?
Yahoo! Auctions - buy the things you want at great prices
Received on Tuesday, 22 May 2001 13:12:42 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:33:54 UTC