W3C home > Mailing lists > Public > semantic-web@w3.org > June 2010

Re: Please stop massive crawling against http://openean.kaufkauf.net/id/

From: Martin Hepp (UniBW) <martin.hepp@ebusiness-unibw.org>
Date: Fri, 11 Jun 2010 17:09:46 +0200
Message-ID: <4C1251BA.2020409@ebusiness-unibw.org>
To: Andreas Harth <andreas@harth.org>
CC: semantic-web at W3C <semantic-web@w3c.org>
Hi Andreas,

On 09.06.10 10:18, Andreas Harth wrote:
> Hi Martin,
>
> first of all, congrats for publishing an apparently popular dataset!
>
>    
Thanks, but we were just initially helping with lifting the data. It's 
hosted on a private machine.
> On Tue, Jun 08, 2010 at 10:04:14AM +0200, Martin Hepp (UniBW) wrote:
>    
>> The crawling has been so intense that he had to temporarily block all
>> traffic to this dataset.
>>      
> Was this before or after you've fixed the redirect issue?
>    
After we fixed the issue.
> In general I agree with you that the crawlers should be bug-free
> and well-behaved.  Unfortunately that's not always the case.
>    
>> 3. implement some bandwidth throttling technique that limits the
>> bandwidth consumption on a single host to a moderate amount.
>>      
> If you want to make sure that only a certain number of requests get
> serviced you could configure throttling on your server.  See e.g. [1].
>    
The main problem is that the * relatively small * semantic web community 
should be very "site-friendly"
in general and in particular to limit crawling load.

Of course, there are many techniques for protecting a site against 
ill-behaved crawlers. However, many of those techniques require a lot of 
skills and expertise that average site-owners don't have.

It would be very bad if "Joe, the siteowner" adds RDFa / RDF/XML to his 
site and the first effect of joining the semantic web effort is that 
academic crawlers kill the server by massive crawling.

Best

Martin
> Best regards,
> Andreas.
>
> [1] http://code.google.com/p/ldspider/wiki/ServerConfig
>
>    

-- 
--------------------------------------------------------------
martin hepp
e-business&  web science research group
universitaet der bundeswehr muenchen

e-mail:  hepp@ebusiness-unibw.org
phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www:     http://www.unibw.de/ebusiness/ (group)
          http://www.heppnetz.de/ (personal)
skype:   mfhepp
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
=================================================================

Project page:
http://purl.org/goodrelations/

Resources for developers:
http://www.ebusiness-unibw.org/wiki/GoodRelations

Webcasts:
Overview - http://www.heppnetz.de/projects/goodrelations/webcast/
How-to   - http://vimeo.com/7583816

Recipe for Yahoo SearchMonkey:
http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey

Talk at the Semantic Technology Conference 2009:
"Semantic Web-based E-Commerce: The GoodRelations Ontology"
http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287

Overview article on Semantic Universe:
http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html

Tutorial materials:
ISWC 2009 Tutorial: The Web of Data for E-Commerce in Brief: A Hands-on Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey
http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_ISWC2009
Received on Friday, 11 June 2010 15:33:50 GMT

This archive was generated by hypermail 2.3.1 : Tuesday, 26 March 2013 21:45:36 GMT