- From: Adam Mlodzinski <Adam.Mlodzinski@quest.com>
- Date: Thu, 7 Sep 2006 19:17:06 -0400
- To: "Silvan Calarco" <silvan.calarco@mambasoft.it>, <www-lib@w3.org>
> -----Original Message----- > From: www-lib-request@w3.org [mailto:www-lib-request@w3.org] > On Behalf Of Silvan Calarco > Sent: Monday, September 04, 2006 6:55 AM > To: www-lib@w3.org > Subject: libwww and avoiding download of binary/unknown files > > > Hi. > I'm writing my first app based on libwww, it aims to do > something similar to webbot but I'm facing a problem that I > can't solve because of my limited knowledge of the libwww > architecture. > When a web site is scanned recursively using anchors and > requests How do you accomplish the recursive scanning? Is this a feature of libwww, or have you written your own code to do this? > all the files are downloaded including binary files. This is always tricky. What is a binary file? Is an image file binary. Probaly, if it's a GIF or PNG, but what about an SVG file? Okay, easy enough. But what about a PDF file, or files with no extension at all? Everyone has their own ideas of what makes a binary file binary, and not text/ASCII. > For these save file name is prompted to the user (my app and > webbot behave in the same manner), but I don't want binary > files to be downloaded at all. If I define the following > callback user is not prompted anymore but file is transferred > from network to the black hole thus generating unuseful traffic: > > HTMIME_setSaveStream(HTBlackHoleConverter); > > So my question is, can I detect the content type of a file > (presumably letting libwww read just a part of it) and then > decide not to download it?How? You have two options: use a HEAD request for each file during the recursive scan (although I don't think all servers support HEAD requests properly) instead of a GET - then decided whether you want the file based on its MIME type (probably set up a filter to do that); OR, decide if you want the file based solely on the file name and/or extension (essentially what MIME does, only instead of asking the server, you decide for yourself). Keep in mind that file extensions don't always give away the file contents - it's a nice convention used 99.9% of the time, but there's nothing preventing anyone from naming a file, ASCII or binary, with any extension they feel like. I know of (vaguely) a Perl script that can tell you if a file is ASCII or binary by reading the first few bytes of the file - but that requires the file to be present, an option you don't have in your case. -- Adam Mlodzinski
Received on Thursday, 7 September 2006 23:17:32 UTC