Re: Think before you write Semantic Web crawlers from Karl Dubost on 2011-06-22 (semantic-web@w3.org from June 2011)

From: Karl Dubost <karld@opera.com>
Date: Wed, 22 Jun 2011 10:34:46 -0400
To: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Cc: semantic-web@w3.org, public-lod@w3.org
Message-Id: <99CDAF15-7237-492C-81AC-6D53141F1B06@opera.com>

Le 21 juin 2011 à 03:49, Martin Hepp a écrit :
> Many of the scripts we saw
> - ignored robots.txt,
> - ignored clear crawling speed limitations in robots.txt,
> - did not identify themselves properly in the HTTP request header or lacked contact information therein, 
> - used no mechanisms at all for limiting the default crawling speed and re-crawling delays.


Do you have a list of those and how to identify them?
So we can put them in our blocking lists?

.htaccess or Apache config with rules such as:

# added for abusive downloads or not respecting robots.txt
SetEnvIfNoCase User-Agent ".*Technorati*." bad_bot
SetEnvIfNoCase User-Agent ".*WikioFeedBot*." bad_bot
# [… cut part of my list …]
Order Allow,Deny
Deny from 85.88.12.104
Deny from env=bad_bot
Allow from all



-- 
Karl Dubost - http://dev.opera.com/
Developer Relations & Tools, Opera Software

Received on Wednesday, 22 June 2011 14:35:29 UTC