[Prev][Next][Index][Thread]

Recursive URL retriever




Dear WWW Library developers,
	I am writing an application using your (great) WWW library.
The application should retrieve recursively document referenced from a starting URL under some constraints in order to avoid downloading the entire Web :).

I read carefully the Library documentation, and examined your LineMode textual browser in order to understand how to implement my tool.

Unfortunately, due to MY limits I was not able to understand how to retrieve all the subparts of a document once I got the starting URL anchor.

Question: how can I retrieve all the document-related parts, including included images, icons used as bullets in dotted lists, dingbats of TITLE tags etc.? How can I retrieve referenced documents e.g. HREF="http://www.test.com/page.html"?

Can I retrieve this data starting from the anchor (maybe using child anchors and links) or I must parse the SGML tags?

Question: once i got tha anchor of a text/html document, how can I access the actual HTML data?

Question: which is the actual meaning of anchors and links? Please, could you provide some simple examples?



Thanks VERY MUCH for your time and attention. I think you are doing a GREAT work! I know that version 4 of the Library is "in fieri", and I am looking forward your December release!

Many thanks again

Giovanni Vigna
vigna@elet.polimi.it