- From: Dave Challis <dsc@ecs.soton.ac.uk>
- Date: Wed, 22 Jun 2011 16:51:52 +0100
- To: public-lod@w3.org
On 22/06/11 16:05, Kingsley Idehen wrote: > On 6/22/11 3:57 PM, Steve Harris wrote: >> Yes, exactly. >> >> I think that the problem is at least partly (and I say this as an >> ex-academic) that few people in academia have the slightest idea how >> much it costs to run a farm of servers in the Real World™. >> >> From the point of view of the crawler they're trying to get as much >> data as possible in a short a time as possible, but don't realise that >> the poor guy at the other end just got his 95th percentile shot >> through the roof, and now has a several thousand dollar bandwidth bill >> heading his way. >> >> You can cap bandwidth, but that then might annoy paying customers, >> which is clearly not good. > > Yes, so we need QoS algorithms or heuristics capable of fine-grained > partitioning re. Who can do What, When, and Where :-) > > Kingsley There are plenty of these around when it comes to web traffic in general. For apache, I can think of ModSecurity (http://www.modsecurity.org/) and mod_evasive (http://www.zdziarski.com/blog/?page_id=442). Both of these will look at traffic patterns and dynamically blacklist as needed. ModSecurity also allows for custom rules to be written depending on get/post content, so it should be perfectly feasible to set up rules based on estimated/actual query cost (e.g. blacklist if client makes > X requests per Y mins which return > Z triples). Can't see any reason why a hybrid approach couldn't be used, e.g. apply rules to unauthenticated traffic, and auto-whitelist clients identifying themselves via WebID. -- Dave Challis dsc@ecs.soton.ac.uk
Received on Wednesday, 22 June 2011 20:52:41 UTC