Re: Think before you write Semantic Web crawlers from Steve Harris on 2011-06-22 (semantic-web@w3.org from June 2011)

From: Steve Harris <steve.harris@garlik.com>
Date: Wed, 22 Jun 2011 15:57:06 +0100
To: Hugh Glaser <hg@ecs.soton.ac.uk>
Cc: Christopher Gutteridge <cjg@ecs.soton.ac.uk>, Martin Hepp <martin.hepp@ebusiness-unibw.org>, Daniel Herzig <herzig@kit.edu>, "semantic-web@w3.org" <semantic-web@w3.org>, "public-lod@w3.org" <public-lod@w3.org>
Message-Id: <83EE1DA2-BA4B-43A3-B179-F21D7BB96F46@garlik.com>

Yes, exactly.

I think that the problem is at least partly (and I say this as an ex-academic) that few people in academia have the slightest idea how much it costs to run a farm of servers in the Real World™.

From the point of view of the crawler they're trying to get as much data as possible in a short a time as possible, but don't realise that the poor guy at the other end just got his 95th percentile shot through the roof, and now has a several thousand dollar bandwidth bill heading his way.

You can cap bandwidth, but that then might annoy paying customers, which is clearly not good.

- Steve

On 2011-06-22, at 12:54, Hugh Glaser wrote:

> Hi Chris.
> One way to do the caching really efficiently:
> http://lists.w3.org/Archives/Public/semantic-web/2007Jun/0012.html
> Which is what rkb has always done.
> But of course caching does not solve the problem of one bad crawler.
> It actually makes it worse.
> You add a cache write cost to the query, without a significant probability of a future cache hit. And increase disk usage.
> 
> Hugh
> 
> ----- Reply message -----
> From: "Christopher Gutteridge" <cjg@ecs.soton.ac.uk>
> To: "Martin Hepp" <martin.hepp@ebusiness-unibw.org>
> Cc: "Daniel Herzig" <herzig@kit.edu>, "semantic-web@w3.org" <semantic-web@w3.org>, "public-lod@w3.org" <public-lod@w3.org>
> Subject: Think before you write Semantic Web crawlers
> Date: Wed, Jun 22, 2011 9:18 am
> 
> 
> 
> The difference between these two scenarios is that there's almost no CPU involvement in serving the PDF file, but naive RDF sites use lots of cycles to generate the response to a query for an RDF document.
> 
> Right now queries to data.southampton.ac.uk (eg. http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made live, but this is not efficient. My colleague, Dave Challis, has prepared a SPARQL endpoint which caches results which we can turn on if the load gets too high, which should at least mitigate the problem. Very few datasets change in a 24 hours period.
> 
> Martin Hepp wrote:
> 
> Hi Daniel,
> Thanks for the link! I will relay this to relevant site-owners.
> 
> However, I still challenge Andreas' statement that the site-owners are to blame for publishing large amounts of data on small servers.
> 
> One can publish 10,000 PDF documents on a tiny server without being hit by DoS-style crazy crawlers. Why should the same not hold if I publish RDF?
> 
> But for sure, it is necessary to advise all publishers of large RDF datasets to protect themselves against hungry crawlers and actual DoS attacks.
> 
> Imagine if a large site was brought down by a botnet that is exploiting Semantic Sitemap information for DoS attacks, focussing on the large dump files.
> This could end LOD experiments for that site.
> 
> 
> Best
> 
> Martin
> 
> 
> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:
> 
> 
> 
> Hi Martin,
> 
> Have you tried to put a Squid [1]  as reverse proxy in front of your servers and use delay pools [2] to catch hungry crawlers?
> 
> Cheers,
> Daniel
> 
> [1] http://www.squid-cache.org/
> [2] http://wiki.squid-cache.org/Features/DelayPools
> 
> On 21.06.2011, at 09:49, Martin Hepp wrote:
> 
> 
> 
> Hi all:
> 
> For the third time in a few weeks, we had massive complaints from site-owners that Semantic Web crawlers from Universities visited their sites in a way close to a denial-of-service attack, i.e., crawling data with maximum bandwidth in a parallelized approach.
> 
> It's clear that a single, stupidly written crawler script, run from a powerful University network, can quickly create terrible traffic load.
> 
> Many of the scripts we saw
> 
> - ignored robots.txt,
> - ignored clear crawling speed limitations in robots.txt,
> - did not identify themselves properly in the HTTP request header or lacked contact information therein,
> - used no mechanisms at all for limiting the default crawling speed and re-crawling delays.
> 
> This irresponsible behavior can be the final reason for site-owners to say farewell to academic/W3C-sponsored semantic technology.
> 
> So please, please - advise all of your colleagues and students to NOT write simple crawler scripts for the billion triples challenge or whatsoever without familiarizing themselves with the state of the art in "friendly crawling".
> 
> Best wishes
> 
> Martin Hepp
> 
> 
> 
> 
> 
> 
> 
> 
> --
> Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248
> 
> You should read the ECS Web Team blog: http://blogs.ecs.soton.ac.uk/webteam/
> 
> 

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD

Received on Wednesday, 22 June 2011 14:57:36 UTC