[Prev][Next][Index][Thread]

Re: Recursive URL retriever




Sorry for the delay - I have been busy putting the official 4.0 release 
together. It is now released to the W3C members and it will go into public 
domain by the end of the year. However, you can find a prerelease which is 
already available from our FTP server. It lacks some finishing off compared to 
the 4.0 version, but the structure and the functionality is pretty much the 
same. Furthermore you can have a look at the two new applications:

A simple command line tool:

	http://www.w3.org/pub/WWW/ComLine/

And a framework for a non-forking, portable server:

	http://www.w3.org/pub/WWW/MiniServ/

> 	I am writing an application using your (great) WWW library.
> The application should retrieve recursively document referenced from a starting URL under some constraints in order to avoid downloading 
 the entire Web :).

Hmmm - I can see that this might be a problem!

 > I read carefully the Library documentation, and examined your LineMode 
textual browser in order to understand how to implement my tool.

This is one of the parts that I didn't got around to write
 
> Unfortunately, due to MY limits I was not able to understand how to retrieve all the subparts of a document once I got the starting URL a
 nchor.
> 
> Question: how can I retrieve all the document-related parts, including included images, icons used as bullets in dotted lists, dingbats o
 f TITLE tags etc.? How can I retrieve referenced documents e.g. 
HREF="http://www.test.com/page.html"?

This functionality kind of is in the Line Mode Browser but of course it 
doesn't get the additional URIs. I am thinking of the "list references" 
functionality that you get if you type

	./www -listref http://www.w3.org

This gives you a list of URLs in this document.

The secret is in the Anchor object. The Anchor object contains a "small map" 
or a subpart of the web that the application has been in touch with. Normally 
this anchor map stays around as long as the application and it just continues 
to grow making new connections between the anchor objects - just like the Web 
itself.

> Can I retrieve this data starting from the anchor (maybe using child anchors and links) or I must parse the SGML tags?

Yes - you do need to parse the HTML documents - otherwise you will not find 
the child anchors. There are two types of anchor objects:  parent anchors and 
child anchors. A parent anchor represents a link to a document - just like we 
have the HREF attribute in a HTML anchor. You can download a parent anchor, 
and it contains all information such as content type, content length, 
expiration date etc.

Child anchors represent parts of an anchor - they are much like the NAME 
attribute in a HTML link. Normally theu are identified by the NAME value but 
if the anchor has no NAME value then a child anchor is created anyway. You can 
get a better idea of what is happening if you run the Line Mode Browser with 
the -va option (show anchor trace):

	./www -va http://www.w3.org

You can not download a child anchor as it contains no means of storing the 
document or the metainformation about it. However, all anchors have a link to 
their parent anchor. If it already _is_ a parent anchor then it points to 
itself, but if it is a child anchor then it points to the parent. 

In addition to having a link to their parent, anchors also have a link to 
their destination, that is the document that they represent and that you want 
to get when you activate the link, for example by clicking on it). A child 
anchor can point to another anchor so what normally happens is that when you 
create a child anchor you also create a parent anchor representing the 
destination of the link.

The libwww anchors can actually have multiple destinations which is a property 
used when you want to POST data to multiple destinations at once. This is 
explained in more detail in the Library Architecture document:

	http://www.w3.org/pub/WWW/Library/User/Architecture/PostWeb.html

However, in your example, you normally only want to follow the main link which 
can be done using the function HTAnchor_followMainLink()

> Question: once i got tha anchor of a text/html document, how can I access the actual HTML data?

The way to do this is by getting the list of children for a parent anchor - 
oups - we need a method here, but for now you must take it directly from the 
anchor structure and then close your eyes ;-)

	HTList * children = anchor->children

and then traverse the list looking for the main destinations. If your 
application has presentation capabilities then you can use the HText interface 
directly. You can get an idea by looking into the GridText.c module of the 
Line Mode Browser.
 
> Thanks VERY MUCH for your time and attention. I think you are doing a GREAT work! I know that version 4 of the Library is "in fieri", and
  I am looking forward your December release!

Great! As I said the current prerelease is already available and there are two new example applications as well.


-- 

Henrik Frystyk Nielsen, <frystyk@w3.org>
World-Wide Web Consortium, MIT/LCS NE43-356
545 Technology Square, Cambridge MA 02139, USA