- From: Karl Dubost <karld@opera.com>
- Date: Wed, 22 Jun 2011 10:34:46 -0400
- To: Martin Hepp <martin.hepp@ebusiness-unibw.org>
- Cc: semantic-web@w3.org, public-lod@w3.org
Le 21 juin 2011 à 03:49, Martin Hepp a écrit : > Many of the scripts we saw > - ignored robots.txt, > - ignored clear crawling speed limitations in robots.txt, > - did not identify themselves properly in the HTTP request header or lacked contact information therein, > - used no mechanisms at all for limiting the default crawling speed and re-crawling delays. Do you have a list of those and how to identify them? So we can put them in our blocking lists? .htaccess or Apache config with rules such as: # added for abusive downloads or not respecting robots.txt SetEnvIfNoCase User-Agent ".*Technorati*." bad_bot SetEnvIfNoCase User-Agent ".*WikioFeedBot*." bad_bot # [… cut part of my list …] Order Allow,Deny Deny from 85.88.12.104 Deny from env=bad_bot Allow from all -- Karl Dubost - http://dev.opera.com/ Developer Relations & Tools, Opera Software
Received on Wednesday, 22 June 2011 14:35:29 UTC