Re: Think before you write Semantic Web crawlers from Dave Challis on 2011-06-22 (public-lod@w3.org from June 2011)

From: Dave Challis <dsc@ecs.soton.ac.uk>
Date: Wed, 22 Jun 2011 16:51:52 +0100
To: public-lod@w3.org
Message-ID: <EMEW3|c0ab373e54b15e47b43daaccd73b751cn5LGok03dsc|ecs.soton.ac.uk|4E020F98.3030>

On 22/06/11 16:05, Kingsley Idehen wrote:
> On 6/22/11 3:57 PM, Steve Harris wrote:
>> Yes, exactly.
>>
>> I think that the problem is at least partly (and I say this as an
>> ex-academic) that few people in academia have the slightest idea how
>> much it costs to run a farm of servers in the Real World™.
>>
>> From the point of view of the crawler they're trying to get as much
>> data as possible in a short a time as possible, but don't realise that
>> the poor guy at the other end just got his 95th percentile shot
>> through the roof, and now has a several thousand dollar bandwidth bill
>> heading his way.
>>
>> You can cap bandwidth, but that then might annoy paying customers,
>> which is clearly not good.
>
> Yes, so we need QoS algorithms or heuristics capable of fine-grained
> partitioning re. Who can do What, When, and Where :-)
>
> Kingsley

There are plenty of these around when it comes to web traffic in 
general.  For apache, I can think of ModSecurity 
(http://www.modsecurity.org/) and mod_evasive 
(http://www.zdziarski.com/blog/?page_id=442).

Both of these will look at traffic patterns and dynamically blacklist as 
needed.

ModSecurity also allows for custom rules to be written depending on 
get/post content, so it should be perfectly feasible to set up rules 
based on estimated/actual query cost (e.g. blacklist if client makes > X 
requests per Y mins which return > Z triples).

Can't see any reason why a hybrid approach couldn't be used, e.g. apply 
rules to unauthenticated traffic, and auto-whitelist clients identifying 
themselves via WebID.

-- 
Dave Challis
dsc@ecs.soton.ac.uk

Received on Wednesday, 22 June 2011 20:52:41 UTC