Re: Think before you write Semantic Web crawlers

There already exist respective blacklists and services, e.g.

http://www.bot-trap.de/home/

It is pretty easy to set up honey pots (e.g. a directory "/bottrap"), link to there from your main-page but disallow crawling in there via robots.txt.

You can the quickly collect and share IPs or IP ranges or agent tokens of clients accessing /bottrap content.

  

On Jun 23, 2011, at 8:27 AM, Antoine Zimmermann wrote:

> Le 22/06/2011 23:49, Richard Cyganiak a écrit :
>> On 21 Jun 2011, at 10:44, Martin Hepp wrote:
>>> PS: I will not release the IP ranges from which the trouble
>>> originated, but rest assured, there were top research institutions
>>> among them.
>> 
>> The right answer is: name and shame. That is the way to teach them.
>> 
>> Like Karl said, we should collect information about abusive crawlers
>> so that site operators can defend themselves. It won't be *that* hard
>> to research and collect the IP ranges of offending universities.
>> 
>> I started a list here: http://www.w3.org/wiki/Bad_Crawlers
> 
> What's the use of this list?
> Assume it stays empty, as you hope. What's the use?
> Assume it gets filled with names: so what? It does not prove these
> crawlers are bad. The authors of the crawlers can just remove themselves
> from the list. If a crawler is on the list, chances are that nobody
> would notice anyway, especially not the kind of people that Martin is
> defending in his email. If a crawler is put to the list because it is
> bad and measures are taken, what happens when the crawler get fixed and
> become polite? And what if measures are taken while the crawler was not bad at all to start with?
> Surely, this list is utterly useless.
> 
> Maybe you can keep the page to describe what are the problems that bad
> crawlers create and what are the measures that publishers can take to
> overcome problematic situation.
> 
> 
> AZ
> 
> 
>> 
>> The list is currently empty. I hope it stays that way.
>> 
>> Thank you all, Richard
> 

Received on Thursday, 23 June 2011 08:37:28 UTC