W3C home > Mailing lists > Public > public-lod@w3.org > June 2011

Think before you publish large datasets (was: Re: Think before you write Semantic Web crawlers)

From: Andreas Harth <andreas@harth.org>
Date: Tue, 21 Jun 2011 14:29:00 +0200
Message-ID: <4E008E8C.2060707@harth.org>
To: public-lod@w3.org
Dear Martin,

I agree with you in that software accessing large portions of the web
should adhere to basic principles (such as robots.txt).

However, I wonder why you publish large datasets and then complain when
people actually use the data.

If you provide a site with millions of triples your infrastructure should
scale beyond "I have clicked on a few links and the server seems to be
doing something".  You should set HTTP expires header to leverage the widely
deployed HTTP caches.  You should have stable URIs.  Also, you should
configure your servers to shield them from both mad crawlers and DOS
attacks (see e.g., [1]).

Publishing millions of triples is slightly more complex than publishing your
personal homepage.

Best regards,
Andreas.

[1] http://code.google.com/p/ldspider/wiki/ServerConfig
Received on Tuesday, 21 June 2011 12:29:34 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:21:13 UTC