Re: Think before you publish large datasets (was: Re: Think before you write Semantic Web crawlers)

Hi Andreas:

I do not publish large datasets, and the complaint was not about someone using them. The complaint was about stupid crawlers bombarding sites with unlimited crawling throughput close to a Denial-of-Service attack.

You may want to ask the Sindice guys re implementing polite yet powerful crawlers,

And yes, your institution was among the origins of the malicious crawlers.

I more and more understand why Google, Bing, and Yahoo did not consult with the LOD research community when launching schema.org.

Best
Martin


--


On Jun 21, 2011, at 2:29 PM, Andreas Harth wrote:

> Dear Martin,
> 
> I agree with you in that software accessing large portions of the web
> should adhere to basic principles (such as robots.txt).
> 
> However, I wonder why you publish large datasets and then complain when
> people actually use the data.
> 
> If you provide a site with millions of triples your infrastructure should
> scale beyond "I have clicked on a few links and the server seems to be
> doing something".  You should set HTTP expires header to leverage the widely
> deployed HTTP caches.  You should have stable URIs.  Also, you should
> configure your servers to shield them from both mad crawlers and DOS
> attacks (see e.g., [1]).
> 
> Publishing millions of triples is slightly more complex than publishing your
> personal homepage.
> 
> Best regards,
> Andreas.
> 
> [1] http://code.google.com/p/ldspider/wiki/ServerConfig
> 

Received on Tuesday, 21 June 2011 18:04:18 UTC