W3C home > Mailing lists > Public > semantic-web@w3.org > June 2010

Re: Please stop massive crawling against?http://openean.kaufkauf.net/id/

From: Martin Hepp (UniBW) <martin.hepp@ebusiness-unibw.org>
Date: Fri, 11 Jun 2010 17:09:46 +0200
Message-ID: <4C1251BA.2020409@ebusiness-unibw.org>
To: Andreas Harth <andreas@harth.org>
CC: semantic-web at W3C <semantic-web@w3c.org>
Hi Andreas,

On 09.06.10 10:18, Andreas Harth wrote:
> Hi Martin,
> first of all, congrats for publishing an apparently popular dataset!
Thanks, but we were just initially helping with lifting the data. It's 
hosted on a private machine.
> On Tue, Jun 08, 2010 at 10:04:14AM +0200, Martin Hepp (UniBW) wrote:
>> The crawling has been so intense that he had to temporarily block all
>> traffic to this dataset.
> Was this before or after you've fixed the redirect issue?
After we fixed the issue.
> In general I agree with you that the crawlers should be bug-free
> and well-behaved.  Unfortunately that's not always the case.
>> 3. implement some bandwidth throttling technique that limits the
>> bandwidth consumption on a single host to a moderate amount.
> If you want to make sure that only a certain number of requests get
> serviced you could configure throttling on your server.  See e.g. [1].
The main problem is that the * relatively small * semantic web community 
should be very "site-friendly"
in general and in particular to limit crawling load.

Of course, there are many techniques for protecting a site against 
ill-behaved crawlers. However, many of those techniques require a lot of 
skills and expertise that average site-owners don't have.

It would be very bad if "Joe, the siteowner" adds RDFa / RDF/XML to his 
site and the first effect of joining the semantic web effort is that 
academic crawlers kill the server by massive crawling.


> Best regards,
> Andreas.
> [1] http://code.google.com/p/ldspider/wiki/ServerConfig

martin hepp
e-business&  web science research group
universitaet der bundeswehr muenchen

e-mail:  hepp@ebusiness-unibw.org
phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www:     http://www.unibw.de/ebusiness/ (group)
          http://www.heppnetz.de/ (personal)
skype:   mfhepp
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!

Project page:

Resources for developers:

Overview - http://www.heppnetz.de/projects/goodrelations/webcast/
How-to   - http://vimeo.com/7583816

Recipe for Yahoo SearchMonkey:

Talk at the Semantic Technology Conference 2009:
"Semantic Web-based E-Commerce: The GoodRelations Ontology"

Overview article on Semantic Universe:

Tutorial materials:
ISWC 2009 Tutorial: The Web of Data for E-Commerce in Brief: A Hands-on Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey
Received on Friday, 11 June 2010 15:33:50 UTC

This archive was generated by hypermail 2.4.0 : Tuesday, 5 July 2022 08:45:18 UTC