- From: Silvan Calarco <silvan.calarco@mambasoft.it>
- Date: Fri, 8 Sep 2006 02:42:58 +0200
- To: "Adam Mlodzinski" <Adam.Mlodzinski@quest.com>
- Cc: www-lib@w3.org
Alle 01:17, venerd́ 8 settembre 2006, Adam Mlodzinski ha scritto: > How do you accomplish the recursive scanning? Is this a feature of > libwww, or have you written your own code to do this? I have defined a link callback function which gets any link from the top page I request. In this function I perform a request for any internal link found, then I wait for the event loop to end and I've got all the pages. > > all the files are downloaded including binary files. > > This is always tricky. What is a binary file? Is an image file binary. > Probaly, if it's a GIF or PNG, but what about an SVG file? Okay, easy > enough. But what about a PDF file, or files with no extension at all? > Everyone has their own ideas of what makes a binary file binary, and not > text/ASCII. By default libwww prompts for saving all the files it doesn't recognize or considers binary, that's enough for me now, libwww does it, but maybe I need to know better how it does it... I just want to get html pages and avoid downloading any other file. > You have two options: use a HEAD request for each file during the > recursive scan (although I don't think all servers support HEAD requests > properly) instead of a GET - then decided whether you want the file > based on its MIME type (probably set up a filter to do that); OR, decide > if you want the file based solely on the file name and/or extension > (essentially what MIME does, only instead of asking the server, you > decide for yourself). > > Keep in mind that file extensions don't always give away the file > contents - it's a nice convention used 99.9% of the time, but there's > nothing preventing anyone from naming a file, ASCII or binary, with any > extension they feel like. I know of (vaguely) a Perl script that can > tell you if a file is ASCII or binary by reading the first few bytes of > the file - but that requires the file to be present, an option you don't > have in your case. I suppose the HEAD request will read only the beginning of a non html file and return that the header is not recognized or is recognized with a MIME type. If I can do that it's enough. I'll try to do what you suggest and let you know. Thanks. Bye, Silvan -- mambaSoft di Silvan Calarco - http://www.mambasoft.it
Received on Friday, 8 September 2006 00:43:13 UTC