Re: Think before you write Semantic Web crawlers

Le 21 juin 2011 à 03:49, Martin Hepp a écrit :
> Many of the scripts we saw
> - ignored robots.txt,
> - ignored clear crawling speed limitations in robots.txt,
> - did not identify themselves properly in the HTTP request header or lacked contact information therein, 
> - used no mechanisms at all for limiting the default crawling speed and re-crawling delays.


Do you have a list of those and how to identify them?
So we can put them in our blocking lists?

.htaccess or Apache config with rules such as:

# added for abusive downloads or not respecting robots.txt
SetEnvIfNoCase User-Agent ".*Technorati*." bad_bot
SetEnvIfNoCase User-Agent ".*WikioFeedBot*." bad_bot
# [… cut part of my list …]
Order Allow,Deny
Deny from 85.88.12.104
Deny from env=bad_bot
Allow from all



-- 
Karl Dubost - http://dev.opera.com/
Developer Relations & Tools, Opera Software

Received on Wednesday, 22 June 2011 14:35:29 UTC