RE: libwww and avoiding download of binary/unknown files from Adam Mlodzinski on 2006-09-08 (www-lib@w3.org from July to September 2006)

From: Adam Mlodzinski <Adam.Mlodzinski@quest.com>
Date: Thu, 7 Sep 2006 21:21:13 -0400
To: "Silvan Calarco" <silvan.calarco@mambasoft.it>
Cc: <www-lib@w3.org>
Message-ID: <46C5DCD410224A42BE3151AC8AE75E4E0904060D@tormbxw01.prod.quest.corp>
> -----Original Message-----
> From: Silvan Calarco [mailto:silvan.calarco@mambasoft.it] 
> Sent: Thursday, September 07, 2006 8:43 PM
> To: Adam Mlodzinski
> Cc: www-lib@w3.org
> Subject: Re: libwww and avoiding download of binary/unknown files
> 
> Alle 01:17, venerdì 8 settembre 2006, Adam Mlodzinski ha scritto:
> > How do you accomplish the recursive scanning? Is this a feature of 
> > libwww, or have you written your own code to do this?
> 
> I have defined a link callback function which gets any link 
> from the top page I request. In this function I perform a 
> request for any internal link found, then I wait for the 
> event loop to end and I've got all the pages.

It sounds like the decision wether to download or not should be made here, in your link callback function. You are essentially telling libwww when it finds a 'link' to 'go and get this file'. Is the SRC of an <IMG> tag a link?
Or, are you hoping that if you just tell libwww to 'go and get this file' that it will respond with 'no, I don't think so - it's a binary file'?


 
> > > all the files are downloaded including binary files.
> >
> > This is always tricky. What is a binary file? Is an image 
> file binary.
> > Probaly, if it's a GIF or PNG, but what about an SVG file? 
> Okay, easy 
> > enough. But what about a PDF file, or files with no 
> extension at all?
> > Everyone has their own ideas of what makes a binary file 
> binary, and 
> > not text/ASCII.
> 
> By default libwww prompts for saving all the files it doesn't 
> recognize or considers binary, that's enough for me now, 

Okay, so you want to decide yourself (with libwww's help) instead of asking the server.


> libwww does it, but maybe I need to know better how it does 
> it... I just want to get html pages and avoid downloading any 
> other file.

Well, it looks like libwww defines file extension mappings to binary file types in HTBInit.c. You might be able to use HTBind_getFormat in your link callback function to tell you information about the file type based on it's name.

 
> > You have two options: use a HEAD request for each file during the 
> > recursive scan (although I don't think all servers support HEAD 
> > requests
> > properly) instead of a GET - then decided whether you want the file 
> > based on its MIME type (probably set up a filter to do that); OR, 
> > decide if you want the file based solely on the file name and/or 
> > extension (essentially what MIME does, only instead of asking the 
> > server, you decide for yourself).
> >
> > Keep in mind that file extensions don't always give away the file 
> > contents - it's a nice convention used 99.9% of the time, 
> but there's 
> > nothing preventing anyone from naming a file, ASCII or binary, with 
> > any extension they feel like. I know of (vaguely) a Perl 
> script that 
> > can tell you if a file is ASCII or binary by reading the first few 
> > bytes of the file - but that requires the file to be present, an 
> > option you don't have in your case.
> 
> I suppose the HEAD request will read only the beginning of a 
> non html file and return that the header is not recognized or 
> is recognized with a MIME type. 

Probably depends on the webserver software - most of them (webservers) will use a file-extension to MIME-type mapping, though there might be some that do what you suggest.

Use a HEAD request if you want the webserver to tell you what type of file a link points to  - use the libwww HTBind_getFormat if you want to figure it out yourself. The latter doesn't even require a HEAD request, so network bandwidth is reduced even further.



> If I can do that it's enough. I'll try to do what you suggest 
> and let you know.
> Thanks.
> 
> Bye,
> Silvan
> 
> --
> mambaSoft di Silvan Calarco - http://www.mambasoft.it
>
Received on Friday, 8 September 2006 01:21:59 UTC