Re: Think before you publish large datasets (was: Re: Think before you write Semantic Web crawlers)

Concur.  Small companies, too, are sometimes surprised by large EC2 invoices.    If people are *using* your data, that's good.  If poorly behaved bots are simply costing you money because their creators can't be bothered to support the robot exclusion protocol, that's bad.

Regards,
Dave




On Jun 21, 2011, at 14:22, Dieter Fensel wrote:

> -1.
> Obviously it is not useful to kill the web server of small shops due to
> academic experiments.
> 
> At 02:29 PM 6/21/2011, Andreas Harth wrote:
>> Dear Martin,
>> 
>> I agree with you in that software accessing large portions of the web
>> should adhere to basic principles (such as robots.txt).
>> 
>> However, I wonder why you publish large datasets and then complain when
>> people actually use the data.
>> 
>> If you provide a site with millions of triples your infrastructure should
>> scale beyond "I have clicked on a few links and the server seems to be
>> doing something".  You should set HTTP expires header to leverage the widely
>> deployed HTTP caches.  You should have stable URIs.  Also, you should
>> configure your servers to shield them from both mad crawlers and DOS
>> attacks (see e.g., [1]).
>> 
>> Publishing millions of triples is slightly more complex than publishing your
>> personal homepage.
>> 
>> Best regards,
>> Andreas.
>> 
>> [1] http://code.google.com/p/ldspider/wiki/ServerConfig
> 
> -- 
> Dieter Fensel
> Director STI Innsbruck, University of Innsbruck, Austria
> http://www.sti-innsbruck.at/
> phone: +43-512-507-6488/5, fax: +43-512-507-9872
> 
> 

Received on Tuesday, 21 June 2011 18:48:04 UTC