Re: Think before you publish large datasets (was: Re: Think before you write Semantic Web crawlers)

-1.
Obviously it is not useful to kill the web server of small shops due to
academic experiments.

At 02:29 PM 6/21/2011, Andreas Harth wrote:
>Dear Martin,
>
>I agree with you in that software accessing large portions of the web
>should adhere to basic principles (such as robots.txt).
>
>However, I wonder why you publish large datasets and then complain when
>people actually use the data.
>
>If you provide a site with millions of triples your infrastructure should
>scale beyond "I have clicked on a few links and the server seems to be
>doing something".  You should set HTTP expires header to leverage the widely
>deployed HTTP caches.  You should have stable URIs.  Also, you should
>configure your servers to shield them from both mad crawlers and DOS
>attacks (see e.g., [1]).
>
>Publishing millions of triples is slightly more complex than publishing your
>personal homepage.
>
>Best regards,
>Andreas.
>
>[1] http://code.google.com/p/ldspider/wiki/ServerConfig

-- 
Dieter Fensel
Director STI Innsbruck, University of Innsbruck, Austria
http://www.sti-innsbruck.at/
phone: +43-512-507-6488/5, fax: +43-512-507-9872

Received on Tuesday, 21 June 2011 18:23:50 UTC