- From: Martin Hepp <martin.hepp@ebusiness-unibw.org>
- Date: Tue, 21 Jun 2011 20:03:48 +0200
- To: Andreas Harth <andreas@harth.org>
- Cc: public-lod@w3.org, Giovanni Tummarello <giovanni.tummarello@deri.org>
Hi Andreas: I do not publish large datasets, and the complaint was not about someone using them. The complaint was about stupid crawlers bombarding sites with unlimited crawling throughput close to a Denial-of-Service attack. You may want to ask the Sindice guys re implementing polite yet powerful crawlers, And yes, your institution was among the origins of the malicious crawlers. I more and more understand why Google, Bing, and Yahoo did not consult with the LOD research community when launching schema.org. Best Martin -- On Jun 21, 2011, at 2:29 PM, Andreas Harth wrote: > Dear Martin, > > I agree with you in that software accessing large portions of the web > should adhere to basic principles (such as robots.txt). > > However, I wonder why you publish large datasets and then complain when > people actually use the data. > > If you provide a site with millions of triples your infrastructure should > scale beyond "I have clicked on a few links and the server seems to be > doing something". You should set HTTP expires header to leverage the widely > deployed HTTP caches. You should have stable URIs. Also, you should > configure your servers to shield them from both mad crawlers and DOS > attacks (see e.g., [1]). > > Publishing millions of triples is slightly more complex than publishing your > personal homepage. > > Best regards, > Andreas. > > [1] http://code.google.com/p/ldspider/wiki/ServerConfig >
Received on Tuesday, 21 June 2011 18:04:18 UTC