- From: Kingsley Idehen <kidehen@openlinksw.com>
- Date: Wed, 22 Jun 2011 15:41:47 +0100
- To: public-lod@w3.org
On 6/22/11 3:34 PM, Karl Dubost wrote: > Le 21 juin 2011 à 03:49, Martin Hepp a écrit : >> Many of the scripts we saw >> - ignored robots.txt, >> - ignored clear crawling speed limitations in robots.txt, >> - did not identify themselves properly in the HTTP request header or lacked contact information therein, >> - used no mechanisms at all for limiting the default crawling speed and re-crawling delays. > > Do you have a list of those and how to identify them? > So we can put them in our blocking lists? > > .htaccess or Apache config with rules such as: > > # added for abusive downloads or not respecting robots.txt > SetEnvIfNoCase User-Agent ".*Technorati*." bad_bot > SetEnvIfNoCase User-Agent ".*WikioFeedBot*." bad_bot > # [… cut part of my list …] > Order Allow,Deny > Deny from 85.88.12.104 > Deny from env=bad_bot > Allow from all > > > But that doesn't solve the big problem. An Apache module for WebID that allows QoS algorithms or heuristics based on Trust Logics is the only way this will scale, ultimately. Apache can get with the program, via modules. Henry and Joe and a few others are working on keeping Apache in step with the new Data Space dimension of the Web :-) -- Regards, Kingsley Idehen President& CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Received on Wednesday, 22 June 2011 14:42:13 UTC