Re: libwww and avoiding download of binary/unknown files from Silvan Calarco on 2006-09-08 (www-lib@w3.org from July to September 2006)

From: Silvan Calarco <silvan.calarco@mambasoft.it>
Date: Fri, 8 Sep 2006 02:42:58 +0200
To: "Adam Mlodzinski" <Adam.Mlodzinski@quest.com>
Cc: www-lib@w3.org
Message-Id: <200609080242.58837.silvan.calarco@mambasoft.it>

Alle 01:17, venerdì 8 settembre 2006, Adam Mlodzinski ha scritto:
> How do you accomplish the recursive scanning? Is this a feature of
> libwww, or have you written your own code to do this?

I have defined a link callback function which gets any link from the top page 
I request. In this function I perform a request for any internal link found, 
then I wait for the event loop to end and I've got all the pages.

> > all the files are downloaded including binary files.
>
> This is always tricky. What is a binary file? Is an image file binary.
> Probaly, if it's a GIF or PNG, but what about an SVG file? Okay, easy
> enough. But what about a PDF file, or files with no extension at all?
> Everyone has their own ideas of what makes a binary file binary, and not
> text/ASCII.

By default libwww prompts for saving all the files it doesn't recognize or 
considers binary, that's enough for me now, libwww does it, but maybe I need 
to know better how it does it... I just want to get html pages and avoid 
downloading any other file.

> You have two options: use a HEAD request for each file during the
> recursive scan (although I don't think all servers support HEAD requests
> properly) instead of a GET - then decided whether you want the file
> based on its MIME type (probably set up a filter to do that); OR, decide
> if you want the file based solely on the file name and/or extension
> (essentially what MIME does, only instead of asking the server, you
> decide for yourself).
>
> Keep in mind that file extensions don't always give away the file
> contents - it's a nice convention used 99.9% of the time, but there's
> nothing preventing anyone from naming a file, ASCII or binary, with any
> extension they feel like. I know of (vaguely) a Perl script that can
> tell you if a file is ASCII or binary by reading the first few bytes of
> the file - but that requires the file to be present, an option you don't
> have in your case.

I suppose the HEAD request will read only the beginning of a non html file and 
return that the header is not recognized or is recognized with a MIME type. 
If I can do that it's enough. I'll try to do what you suggest and let you 
know.
Thanks.

Bye,
Silvan

-- 
mambaSoft di Silvan Calarco - http://www.mambasoft.it

Received on Friday, 8 September 2006 00:43:13 UTC