RE: libwww and avoiding download of binary/unknown files from Adam Mlodzinski on 2006-09-07 (www-lib@w3.org from July to September 2006)

From: Adam Mlodzinski <Adam.Mlodzinski@quest.com>
Date: Thu, 7 Sep 2006 19:17:06 -0400
To: "Silvan Calarco" <silvan.calarco@mambasoft.it>, <www-lib@w3.org>
Message-ID: <46C5DCD410224A42BE3151AC8AE75E4E090405F9@tormbxw01.prod.quest.corp>

> -----Original Message-----
> From: www-lib-request@w3.org [mailto:www-lib-request@w3.org] 
> On Behalf Of Silvan Calarco
> Sent: Monday, September 04, 2006 6:55 AM
> To: www-lib@w3.org
> Subject: libwww and avoiding download of binary/unknown files
> 
> 
> Hi.
> I'm writing my first app based on libwww, it aims to do 
> something similar to webbot but I'm facing a problem that I 
> can't solve because of my limited knowledge of the libwww 
> architecture. 
> When a web site is scanned recursively using anchors and 
> requests


How do you accomplish the recursive scanning? Is this a feature of
libwww, or have you written your own code to do this?



> all the files are downloaded including binary files. 


This is always tricky. What is a binary file? Is an image file binary.
Probaly, if it's a GIF or PNG, but what about an SVG file? Okay, easy
enough. But what about a PDF file, or files with no extension at all?
Everyone has their own ideas of what makes a binary file binary, and not
text/ASCII.


> For these save file name is prompted to the user (my app and 
> webbot behave in the same manner), but I don't want binary 
> files to be downloaded at all. If I define the following 
> callback user is not prompted anymore but file is transferred 
> from network to the black hole thus generating unuseful traffic:
> 
> HTMIME_setSaveStream(HTBlackHoleConverter);
> 
> So my question is, can I detect the content type of a file 
> (presumably letting libwww read just a part of it) and then 
> decide not to download it?How?

You have two options: use a HEAD request for each file during the
recursive scan (although I don't think all servers support HEAD requests
properly) instead of a GET - then decided whether you want the file
based on its MIME type (probably set up a filter to do that); OR, decide
if you want the file based solely on the file name and/or extension
(essentially what MIME does, only instead of asking the server, you
decide for yourself).

Keep in mind that file extensions don't always give away the file
contents - it's a nice convention used 99.9% of the time, but there's
nothing preventing anyone from naming a file, ASCII or binary, with any
extension they feel like. I know of (vaguely) a Perl script that can
tell you if a file is ASCII or binary by reading the first few bytes of
the file - but that requires the file to be present, an option you don't
have in your case.



--
Adam Mlodzinski

Received on Thursday, 7 September 2006 23:17:32 UTC