W3C home > Mailing lists > Public > semantic-web@w3.org > June 2011

Re: Think before you write Semantic Web crawlers

From: Karl Dubost <karld@opera.com>
Date: Wed, 22 Jun 2011 10:34:46 -0400
Message-Id: <99CDAF15-7237-492C-81AC-6D53141F1B06@opera.com>
Cc: semantic-web@w3.org, public-lod@w3.org
To: Martin Hepp <martin.hepp@ebusiness-unibw.org>

Le 21 juin 2011 à 03:49, Martin Hepp a écrit :
> Many of the scripts we saw
> - ignored robots.txt,
> - ignored clear crawling speed limitations in robots.txt,
> - did not identify themselves properly in the HTTP request header or lacked contact information therein, 
> - used no mechanisms at all for limiting the default crawling speed and re-crawling delays.

Do you have a list of those and how to identify them?
So we can put them in our blocking lists?

.htaccess or Apache config with rules such as:

# added for abusive downloads or not respecting robots.txt
SetEnvIfNoCase User-Agent ".*Technorati*." bad_bot
SetEnvIfNoCase User-Agent ".*WikioFeedBot*." bad_bot
# [… cut part of my list …]
Order Allow,Deny
Deny from
Deny from env=bad_bot
Allow from all

Karl Dubost - http://dev.opera.com/
Developer Relations & Tools, Opera Software
Received on Wednesday, 22 June 2011 14:35:29 UTC

This archive was generated by hypermail 2.4.0 : Tuesday, 5 July 2022 08:45:25 UTC