W3C home > Mailing lists > Public > public-lod@w3.org > June 2011

Re: Think before you write Semantic Web crawlers

From: Antoine Zimmermann <antoine.zimmermann@insa-lyon.fr>
Date: Thu, 23 Jun 2011 11:40:13 +0200
Message-ID: <4E0309FD.5030606@insa-lyon.fr>
To: public-lod@w3.org

My concern is not really about the idea of blacklisting etc. I am 
concerned about the means. Certainly a public wikipage is not a good 
place to put accusations.

Le 23/06/2011 11:01, Richard Cyganiak a écrit :
> Antoine,
> On 23 Jun 2011, at 07:27, Antoine Zimmermann wrote:
>>> I started a list here: http://www.w3.org/wiki/Bad_Crawlers
>> What's the use of this list? Assume it stays empty, as you hope.
>> What's the use?
> That should be obvious.

Not to me. Can you elaborate?

>> Assume it gets filled with names: so what? It does not prove these
>> crawlers are bad. The authors of the crawlers can just remove
>> themselves from the list.
> Check out the "watch" and "history" tabs on that page.


so on Thursday 23rd, 9:04, user foobar96 wrote that Sindice is a bad
crawler. Then what?

>> If a crawler is on the list, chances are that nobody would notice
>> anyway, especially not the kind of people that Martin is defending
>> in his email.
> It takes very little effort to make a copy-paste Apache config
> snippet that blocks the offending IP ranges. Pointing the victims of
> abusive crawlers to such a snippet is a first-aid measure.

How do you know who are the victims? They somehow have to make 
themselves known so that they can be directed to the wiki page. If you 
know the victims, you'd better give them the config snippet directly. A 
wiki page which is /accusing/ people is much more likely to be 
inaccurate (or empty) than a wiki page with encyclopedic details on 
common knowledge.

>> If a crawler is put to the list because it is bad and measures are
>> taken, what happens when the crawler get fixed and become polite?
>> And what if measures are taken while the crawler was not bad at all
>> to start with?
> It shifts some pain from the server operators to the crawler
> operators who have to see how they get off the list again. That's a
> good thing.

It's a public wiki. It can hardly be simpler to get off the list.

>> Surely, this list is utterly useless.
> It's important to show that the community is taking the issue serious
> and is establishing social norms and processes to deal with problems
> as they arise. These processes will start out primitive, but I'd
> claim that a wiki page is one step up in sophistication from this
> mailing list thread.

I hear you, but not like that, not with a public wiki page.


> Best, Richard
>> Maybe you can keep the page to describe what are the problems that
>> bad crawlers create and what are the measures that publishers can
>> take to overcome problematic situation.
>> AZ
>>> The list is currently empty. I hope it stays that way.
>>> Thank you all, Richard

Antoine Zimmermann
Researcher at:
Laboratoire d'InfoRmatique en Image et Systèmes d'information
Database Group
7 Avenue Jean Capelle
69621 Villeurbanne Cedex
Tel: +33(0)4 72 43 61 74 - Fax: +33(0)4 72 43 87 13
Lecturer at:
Institut National des Sciences Appliquées de Lyon
20 Avenue Albert Einstein
69621 Villeurbanne Cedex
Received on Thursday, 23 June 2011 09:40:53 UTC

This archive was generated by hypermail 2.4.0 : Thursday, 24 March 2022 20:29:54 UTC