RE: Think before you write Semantic Web crawlers from Esteban Sota Leiva on 2011-06-21 (semantic-web@w3.org from June 2011)

From: Esteban Sota Leiva <estebansota@riam.es>
Date: Tue, 21 Jun 2011 10:19:29 +0200
To: <semantic-web@w3.org>
Message-ID: <036701cc2feb$f233e8d0$d69bba70$@es>

Amen! I could not have said it better.

Esteban Sota

-----Mensaje original-----
De: semantic-web-request@w3.org [mailto:semantic-web-request@w3.org] En
nombre de Martin Hepp
Enviado el: martes, 21 de junio de 2011 9:50
Para: semantic-web@w3.org; public-lod@w3.org
Asunto: Think before you write Semantic Web crawlers

Hi all:

For the third time in a few weeks, we had massive complaints from
site-owners that Semantic Web crawlers from Universities visited their sites
in a way close to a denial-of-service attack, i.e., crawling data with
maximum bandwidth in a parallelized approach.

It's clear that a single, stupidly written crawler script, run from a
powerful University network, can quickly create terrible traffic load. 

Many of the scripts we saw

- ignored robots.txt,
- ignored clear crawling speed limitations in robots.txt,
- did not identify themselves properly in the HTTP request header or lacked
contact information therein, 
- used no mechanisms at all for limiting the default crawling speed and
re-crawling delays.

This irresponsible behavior can be the final reason for site-owners to say
farewell to academic/W3C-sponsored semantic technology.

So please, please - advise all of your colleagues and students to NOT write
simple crawler scripts for the billion triples challenge or whatsoever
without familiarizing themselves with the state of the art in "friendly
crawling".

Best wishes

Martin Hepp

Received on Tuesday, 21 June 2011 22:45:05 UTC