Re: Recursive URL retriever
To: "Giovanni Vigna" <email@example.com>
Subject: Re: Recursive URL retriever
From: Henrik Frystyk Nielsen <firstname.lastname@example.org>
Date: Mon, 04 Dec 1995 10:44:53 -0500
From email@example.com Mon Dec 4 10: 51:20 1995
Reply-To: Henrik Frystyk Nielsen <firstname.lastname@example.org>
X-Mailer: exmh version 1.6.2 7/18/95
Sorry for the delay - I have been busy putting the official 4.0 release
together. It is now released to the W3C members and it will go into public
domain by the end of the year. However, you can find a prerelease which is
already available from our FTP server. It lacks some finishing off compared to
the 4.0 version, but the structure and the functionality is pretty much the
same. Furthermore you can have a look at the two new applications:
A simple command line tool:
And a framework for a non-forking, portable server:
> I am writing an application using your (great) WWW library.
> The application should retrieve recursively document referenced from a starting URL under some constraints in order to avoid downloading
the entire Web :).
Hmmm - I can see that this might be a problem!
> I read carefully the Library documentation, and examined your LineMode
textual browser in order to understand how to implement my tool.
This is one of the parts that I didn't got around to write
> Unfortunately, due to MY limits I was not able to understand how to retrieve all the subparts of a document once I got the starting URL a
> Question: how can I retrieve all the document-related parts, including included images, icons used as bullets in dotted lists, dingbats o
f TITLE tags etc.? How can I retrieve referenced documents e.g.
This functionality kind of is in the Line Mode Browser but of course it
doesn't get the additional URIs. I am thinking of the "list references"
functionality that you get if you type
./www -listref http://www.w3.org
This gives you a list of URLs in this document.
The secret is in the Anchor object. The Anchor object contains a "small map"
or a subpart of the web that the application has been in touch with. Normally
this anchor map stays around as long as the application and it just continues
to grow making new connections between the anchor objects - just like the Web
> Can I retrieve this data starting from the anchor (maybe using child anchors and links) or I must parse the SGML tags?
Yes - you do need to parse the HTML documents - otherwise you will not find
the child anchors. There are two types of anchor objects: parent anchors and
child anchors. A parent anchor represents a link to a document - just like we
have the HREF attribute in a HTML anchor. You can download a parent anchor,
and it contains all information such as content type, content length,
expiration date etc.
Child anchors represent parts of an anchor - they are much like the NAME
attribute in a HTML link. Normally theu are identified by the NAME value but
if the anchor has no NAME value then a child anchor is created anyway. You can
get a better idea of what is happening if you run the Line Mode Browser with
the -va option (show anchor trace):
./www -va http://www.w3.org
You can not download a child anchor as it contains no means of storing the
document or the metainformation about it. However, all anchors have a link to
their parent anchor. If it already _is_ a parent anchor then it points to
itself, but if it is a child anchor then it points to the parent.
In addition to having a link to their parent, anchors also have a link to
their destination, that is the document that they represent and that you want
to get when you activate the link, for example by clicking on it). A child
anchor can point to another anchor so what normally happens is that when you
create a child anchor you also create a parent anchor representing the
destination of the link.
The libwww anchors can actually have multiple destinations which is a property
used when you want to POST data to multiple destinations at once. This is
explained in more detail in the Library Architecture document:
However, in your example, you normally only want to follow the main link which
can be done using the function HTAnchor_followMainLink()
> Question: once i got tha anchor of a text/html document, how can I access the actual HTML data?
The way to do this is by getting the list of children for a parent anchor -
oups - we need a method here, but for now you must take it directly from the
anchor structure and then close your eyes ;-)
HTList * children = anchor->children
and then traverse the list looking for the main destinations. If your
application has presentation capabilities then you can use the HText interface
directly. You can get an idea by looking into the GridText.c module of the
Line Mode Browser.
> Thanks VERY MUCH for your time and attention. I think you are doing a GREAT work! I know that version 4 of the Library is "in fieri", and
I am looking forward your December release!
Great! As I said the current prerelease is already available and there are two new example applications as well.
Henrik Frystyk Nielsen, <email@example.com>
World-Wide Web Consortium, MIT/LCS NE43-356
545 Technology Square, Cambridge MA 02139, USA