Think before you publish large datasets (was: Re: Think before you write Semantic Web crawlers) from Andreas Harth on 2011-06-21 (public-lod@w3.org from June 2011)

From: Andreas Harth <andreas@harth.org>
Date: Tue, 21 Jun 2011 14:29:00 +0200
To: public-lod@w3.org
Message-ID: <4E008E8C.2060707@harth.org>

Dear Martin,

I agree with you in that software accessing large portions of the web
should adhere to basic principles (such as robots.txt).

However, I wonder why you publish large datasets and then complain when
people actually use the data.

If you provide a site with millions of triples your infrastructure should
scale beyond "I have clicked on a few links and the server seems to be
doing something".  You should set HTTP expires header to leverage the widely
deployed HTTP caches.  You should have stable URIs.  Also, you should
configure your servers to shield them from both mad crawlers and DOS
attacks (see e.g., [1]).

Publishing millions of triples is slightly more complex than publishing your
personal homepage.

Best regards,
Andreas.

[1] http://code.google.com/p/ldspider/wiki/ServerConfig

Received on Tuesday, 21 June 2011 12:29:34 UTC