W3C home > Mailing lists > Public > public-lod@w3.org > June 2011

Re: Think before you write Semantic Web crawlers

From: Daniel Herzig <herzig@kit.edu>
Date: Tue, 21 Jun 2011 10:24:07 +0200
Cc: <semantic-web@w3.org>, <public-lod@w3.org>
Message-Id: <BF174FDD-5535-47C3-9D2D-793220AF4ACF@kit.edu>
To: Martin Hepp <martin.hepp@ebusiness-unibw.org>

Hi Martin,

Have you tried to put a Squid [1]  as reverse proxy in front of your servers and use delay pools [2] to catch hungry crawlers?

Cheers,
Daniel

[1] http://www.squid-cache.org/
[2] http://wiki.squid-cache.org/Features/DelayPools

On 21.06.2011, at 09:49, Martin Hepp wrote:

> Hi all:
> 
> For the third time in a few weeks, we had massive complaints from site-owners that Semantic Web crawlers from Universities visited their sites in a way close to a denial-of-service attack, i.e., crawling data with maximum bandwidth in a parallelized approach.
> 
> It's clear that a single, stupidly written crawler script, run from a powerful University network, can quickly create terrible traffic load. 
> 
> Many of the scripts we saw
> 
> - ignored robots.txt,
> - ignored clear crawling speed limitations in robots.txt,
> - did not identify themselves properly in the HTTP request header or lacked contact information therein, 
> - used no mechanisms at all for limiting the default crawling speed and re-crawling delays.
> 
> This irresponsible behavior can be the final reason for site-owners to say farewell to academic/W3C-sponsored semantic technology.
> 
> So please, please - advise all of your colleagues and students to NOT write simple crawler scripts for the billion triples challenge or whatsoever without familiarizing themselves with the state of the art in "friendly crawling".
> 
> Best wishes
> 
> Martin Hepp
> 


Received on Tuesday, 21 June 2011 22:45:06 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:21:13 UTC