Re: Please stop massive crawling against http://openean.kaufkauf.net/id/

Hi all, 

> The volunteer who is hosting http://openean.kaufkauf.net/id/, a huge set of GoodRelations product model data, is experiencing a problematic amount of traffic from unidentified crawlers located in Ireland (DERI?), the Netherlands (VUA?), and the USA.
> 


Another crawler used from DERI is the LDSpider[1] which we use to crawl data for the SWSE search engine and recently for the BTC 2010 dataset. 
Along these lines we admittedly have been doing an unusually large amount of crawling in the past month or two.

> The crawling has been so intense that he had to temporarily block all traffic to this dataset.
> 
> In case you are operating any kind of Semantic Web crawlers that tried to access this dataset, please
> 
> 1. check your crawler for bugs that create excessive traffic (e.g. by redundant requests),

> 2. identify your crawler agent properly in the HTTP header, indicating a contact person, and

User-agent of the LDSpider:
  * ldspider (http://code.google.com/p/ldspider/wiki/Robots)

> 3. implement some bandwidth throttling technique that limits the bandwidth consumption on a single host to a moderate amount.


The LDSpider uses a delay policy similar to the one proposed in the IRLBot system. 
We have the following delay times per PLD (in the case of http://openean.kaufkauf.net/id the PLD is kaufkauf.net)
 * 500 ms for lookups which return content (200 resp code)
 * 250 ms for lookups which return no content (e.g. 30X, 40X, 50X).

There are also solutions for server side bandwidth throttling (e.g.  see [2]).

Please see also the reply of Andreas Harth at the semantic-web mailing list [3].

Best
   Juergen

[1] http://code.google.com/p/ldspider/
[2] http://code.google.com/p/ldspider/wiki/ServerConfig
[3] http://lists.w3.org/Archives/Public/semantic-web/2010Jun/0048.html

Received on Wednesday, 9 June 2010 11:40:11 UTC