W3C home > Mailing lists > Public > public-lod@w3.org > June 2011

Re: Think before you publish large datasets (was: Re: Think before you write Semantic Web crawlers)

From: Dieter Fensel <dieter.fensel@sti2.at>
Date: Tue, 21 Jun 2011 20:22:16 +0200
To: Andreas Harth <harth@kit.edu>,public-lod@w3.org
Message-ID: <E1QZ5c1-000114-HO@aji.keio.w3.org>
Obviously it is not useful to kill the web server of small shops due to
academic experiments.

At 02:29 PM 6/21/2011, Andreas Harth wrote:
>Dear Martin,
>I agree with you in that software accessing large portions of the web
>should adhere to basic principles (such as robots.txt).
>However, I wonder why you publish large datasets and then complain when
>people actually use the data.
>If you provide a site with millions of triples your infrastructure should
>scale beyond "I have clicked on a few links and the server seems to be
>doing something".  You should set HTTP expires header to leverage the widely
>deployed HTTP caches.  You should have stable URIs.  Also, you should
>configure your servers to shield them from both mad crawlers and DOS
>attacks (see e.g., [1]).
>Publishing millions of triples is slightly more complex than publishing your
>personal homepage.
>Best regards,
>[1] http://code.google.com/p/ldspider/wiki/ServerConfig

Dieter Fensel
Director STI Innsbruck, University of Innsbruck, Austria
phone: +43-512-507-6488/5, fax: +43-512-507-9872
Received on Tuesday, 21 June 2011 18:23:50 UTC

This archive was generated by hypermail 2.4.0 : Thursday, 24 March 2022 20:29:54 UTC