Re: Think before you publish large datasets (was: Re: Think before you write Semantic Web crawlers) from Martin Hepp on 2011-06-21 (public-lod@w3.org from June 2011)

From: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Date: Tue, 21 Jun 2011 20:03:48 +0200
To: Andreas Harth <andreas@harth.org>
Cc: public-lod@w3.org, Giovanni Tummarello <giovanni.tummarello@deri.org>
Message-Id: <1F22F119-A1EF-4A47-9CEA-A439DE4A9F34@ebusiness-unibw.org>

Hi Andreas:

I do not publish large datasets, and the complaint was not about someone using them. The complaint was about stupid crawlers bombarding sites with unlimited crawling throughput close to a Denial-of-Service attack.

You may want to ask the Sindice guys re implementing polite yet powerful crawlers,

And yes, your institution was among the origins of the malicious crawlers.

I more and more understand why Google, Bing, and Yahoo did not consult with the LOD research community when launching schema.org.

Best
Martin

--

On Jun 21, 2011, at 2:29 PM, Andreas Harth wrote:

> Dear Martin,
> 
> I agree with you in that software accessing large portions of the web
> should adhere to basic principles (such as robots.txt).
> 
> However, I wonder why you publish large datasets and then complain when
> people actually use the data.
> 
> If you provide a site with millions of triples your infrastructure should
> scale beyond "I have clicked on a few links and the server seems to be
> doing something".  You should set HTTP expires header to leverage the widely
> deployed HTTP caches.  You should have stable URIs.  Also, you should
> configure your servers to shield them from both mad crawlers and DOS
> attacks (see e.g., [1]).
> 
> Publishing millions of triples is slightly more complex than publishing your
> personal homepage.
> 
> Best regards,
> Andreas.
> 
> [1] http://code.google.com/p/ldspider/wiki/ServerConfig
>

Received on Tuesday, 21 June 2011 18:04:18 UTC