Re: Think before you publish large datasets (was: Re: Think before you write Semantic Web crawlers) from Dieter Fensel on 2011-06-21 (public-lod@w3.org from June 2011)

From: Dieter Fensel <dieter.fensel@sti2.at>
Date: Tue, 21 Jun 2011 20:22:16 +0200
To: Andreas Harth <harth@kit.edu>,public-lod@w3.org
Message-ID: <E1QZ5c1-000114-HO@aji.keio.w3.org>

-1.
Obviously it is not useful to kill the web server of small shops due to
academic experiments.

At 02:29 PM 6/21/2011, Andreas Harth wrote:
>Dear Martin,
>
>I agree with you in that software accessing large portions of the web
>should adhere to basic principles (such as robots.txt).
>
>However, I wonder why you publish large datasets and then complain when
>people actually use the data.
>
>If you provide a site with millions of triples your infrastructure should
>scale beyond "I have clicked on a few links and the server seems to be
>doing something".  You should set HTTP expires header to leverage the widely
>deployed HTTP caches.  You should have stable URIs.  Also, you should
>configure your servers to shield them from both mad crawlers and DOS
>attacks (see e.g., [1]).
>
>Publishing millions of triples is slightly more complex than publishing your
>personal homepage.
>
>Best regards,
>Andreas.
>
>[1] http://code.google.com/p/ldspider/wiki/ServerConfig

-- 
Dieter Fensel
Director STI Innsbruck, University of Innsbruck, Austria
http://www.sti-innsbruck.at/
phone: +43-512-507-6488/5, fax: +43-512-507-9872

Received on Tuesday, 21 June 2011 18:23:50 UTC