Re: Think before you publish large datasets (was: Re: Think before you write Semantic Web crawlers) from Alan Ruttenberg on 2011-06-21 (public-lod@w3.org from June 2011)

From: Alan Ruttenberg <alanruttenberg@gmail.com>
Date: Tue, 21 Jun 2011 14:58:01 -0400
To: David Wood <david@3roundstones.com>
Cc: Dieter Fensel <dieter.fensel@sti2.at>, Andreas Harth <harth@kit.edu>, public-lod@w3.org
Message-ID: <BANLkTikAMaj4_g0N31en6EwnDMaPi=5z9w@mail.gmail.com>

In the words of the great Al Franken: "It's easier to put on slippers
than to carpet the world".
http://www.quotationspage.com/quotes/Al_Franken/

While I don't support poorly written software, it's probably a good
idea to publish at your web site some recipes for defenses against
poor spidering. We've been bitten even by standard spiders when we
accidentally left open certain URLs in our mediawiki installations.

Needless to say, your remark about schema.org is probably not going to
help get your message listened to ;-)

-Alan
"I’m good enough, I’m smart enough, and dog-gone it, people like me."

On Tue, Jun 21, 2011 at 2:47 PM, David Wood <david@3roundstones.com> wrote:
> Concur.  Small companies, too, are sometimes surprised by large EC2 invoices.    If people are *using* your data, that's good.  If poorly behaved bots are simply costing you money because their creators can't be bothered to support the robot exclusion protocol, that's bad.
>
> Regards,
> Dave
>
>
>
>
> On Jun 21, 2011, at 14:22, Dieter Fensel wrote:
>
>> -1.
>> Obviously it is not useful to kill the web server of small shops due to
>> academic experiments.
>>
>> At 02:29 PM 6/21/2011, Andreas Harth wrote:
>>> Dear Martin,
>>>
>>> I agree with you in that software accessing large portions of the web
>>> should adhere to basic principles (such as robots.txt).
>>>
>>> However, I wonder why you publish large datasets and then complain when
>>> people actually use the data.
>>>
>>> If you provide a site with millions of triples your infrastructure should
>>> scale beyond "I have clicked on a few links and the server seems to be
>>> doing something".  You should set HTTP expires header to leverage the widely
>>> deployed HTTP caches.  You should have stable URIs.  Also, you should
>>> configure your servers to shield them from both mad crawlers and DOS
>>> attacks (see e.g., [1]).
>>>
>>> Publishing millions of triples is slightly more complex than publishing your
>>> personal homepage.
>>>
>>> Best regards,
>>> Andreas.
>>>
>>> [1] http://code.google.com/p/ldspider/wiki/ServerConfig
>>
>> --
>> Dieter Fensel
>> Director STI Innsbruck, University of Innsbruck, Austria
>> http://www.sti-innsbruck.at/
>> phone: +43-512-507-6488/5, fax: +43-512-507-9872
>>
>>
>
>
>

Received on Tuesday, 21 June 2011 18:58:49 UTC