W3C home > Mailing lists > Public > public-lod@w3.org > June 2011

Re: Think before you write Semantic Web crawlers

From: Richard Cyganiak <richard@cyganiak.de>
Date: Thu, 23 Jun 2011 10:01:06 +0100
Cc: public-lod@w3.org
Message-Id: <76038FA6-CBCB-4C3A-A357-33746691CE8E@cyganiak.de>
To: antoine.zimmermann@insa-lyon.fr
Antoine,

On 23 Jun 2011, at 07:27, Antoine Zimmermann wrote:
>> I started a list here: http://www.w3.org/wiki/Bad_Crawlers
> 
> What's the use of this list?
> Assume it stays empty, as you hope. What's the use?

That should be obvious.

> Assume it gets filled with names: so what? It does not prove these
> crawlers are bad. The authors of the crawlers can just remove themselves
> from the list.

Check out the "watch" and "history" tabs on that page.

> If a crawler is on the list, chances are that nobody
> would notice anyway, especially not the kind of people that Martin is
> defending in his email.

It takes very little effort to make a copy-paste Apache config snippet that blocks the offending IP ranges. Pointing the victims of abusive crawlers to such a snippet is a first-aid measure.

> If a crawler is put to the list because it is
> bad and measures are taken, what happens when the crawler get fixed and
> become polite? And what if measures are taken while the crawler was not bad at all to start with?

It shifts some pain from the server operators to the crawler operators who have to see how they get off the list again. That's a good thing.

> Surely, this list is utterly useless.

It's important to show that the community is taking the issue serious and is establishing social norms and processes to deal with problems as they arise. These processes will start out primitive, but I'd claim that a wiki page is one step up in sophistication from this mailing list thread.

Best,
Richard




> 
> Maybe you can keep the page to describe what are the problems that bad
> crawlers create and what are the measures that publishers can take to
> overcome problematic situation.
> 
> 
> AZ
> 
> 
>> 
>> The list is currently empty. I hope it stays that way.
>> 
>> Thank you all, Richard
> 
Received on Thursday, 23 June 2011 09:01:43 UTC

This archive was generated by hypermail 2.4.0 : Thursday, 24 March 2022 20:29:54 UTC