- From: Henrik Frystyk Nielsen <frystyk@w3.org>
- Date: Mon, 04 Dec 1995 10:44:53 -0500
- To: "Giovanni Vigna" <vigna@ipmel2.elet.polimi.it>
- Cc: www-lib@w3.org
Sorry for the delay - I have been busy putting the official 4.0 release together. It is now released to the W3C members and it will go into public domain by the end of the year. However, you can find a prerelease which is already available from our FTP server. It lacks some finishing off compared to the 4.0 version, but the structure and the functionality is pretty much the same. Furthermore you can have a look at the two new applications: A simple command line tool: http://www.w3.org/pub/WWW/ComLine/ And a framework for a non-forking, portable server: http://www.w3.org/pub/WWW/MiniServ/ > I am writing an application using your (great) WWW library. > The application should retrieve recursively document referenced from a starting URL under some constraints in order to avoid downloading the entire Web :). Hmmm - I can see that this might be a problem! > I read carefully the Library documentation, and examined your LineMode textual browser in order to understand how to implement my tool. This is one of the parts that I didn't got around to write > Unfortunately, due to MY limits I was not able to understand how to retrieve all the subparts of a document once I got the starting URL a nchor. > > Question: how can I retrieve all the document-related parts, including included images, icons used as bullets in dotted lists, dingbats o f TITLE tags etc.? How can I retrieve referenced documents e.g. HREF="http://www.test.com/page.html"? This functionality kind of is in the Line Mode Browser but of course it doesn't get the additional URIs. I am thinking of the "list references" functionality that you get if you type ./www -listref http://www.w3.org This gives you a list of URLs in this document. The secret is in the Anchor object. The Anchor object contains a "small map" or a subpart of the web that the application has been in touch with. Normally this anchor map stays around as long as the application and it just continues to grow making new connections between the anchor objects - just like the Web itself. > Can I retrieve this data starting from the anchor (maybe using child anchors and links) or I must parse the SGML tags? Yes - you do need to parse the HTML documents - otherwise you will not find the child anchors. There are two types of anchor objects: parent anchors and child anchors. A parent anchor represents a link to a document - just like we have the HREF attribute in a HTML anchor. You can download a parent anchor, and it contains all information such as content type, content length, expiration date etc. Child anchors represent parts of an anchor - they are much like the NAME attribute in a HTML link. Normally theu are identified by the NAME value but if the anchor has no NAME value then a child anchor is created anyway. You can get a better idea of what is happening if you run the Line Mode Browser with the -va option (show anchor trace): ./www -va http://www.w3.org You can not download a child anchor as it contains no means of storing the document or the metainformation about it. However, all anchors have a link to their parent anchor. If it already _is_ a parent anchor then it points to itself, but if it is a child anchor then it points to the parent. In addition to having a link to their parent, anchors also have a link to their destination, that is the document that they represent and that you want to get when you activate the link, for example by clicking on it). A child anchor can point to another anchor so what normally happens is that when you create a child anchor you also create a parent anchor representing the destination of the link. The libwww anchors can actually have multiple destinations which is a property used when you want to POST data to multiple destinations at once. This is explained in more detail in the Library Architecture document: http://www.w3.org/pub/WWW/Library/User/Architecture/PostWeb.html However, in your example, you normally only want to follow the main link which can be done using the function HTAnchor_followMainLink() > Question: once i got tha anchor of a text/html document, how can I access the actual HTML data? The way to do this is by getting the list of children for a parent anchor - oups - we need a method here, but for now you must take it directly from the anchor structure and then close your eyes ;-) HTList * children = anchor->children and then traverse the list looking for the main destinations. If your application has presentation capabilities then you can use the HText interface directly. You can get an idea by looking into the GridText.c module of the Line Mode Browser. > Thanks VERY MUCH for your time and attention. I think you are doing a GREAT work! I know that version 4 of the Library is "in fieri", and I am looking forward your December release! Great! As I said the current prerelease is already available and there are two new example applications as well. -- Henrik Frystyk Nielsen, <frystyk@w3.org> World-Wide Web Consortium, MIT/LCS NE43-356 545 Technology Square, Cambridge MA 02139, USA
Received on Monday, 4 December 1995 10:51:20 UTC