Re: Think before you write Semantic Web crawlers

On 6/22/11 4:51 PM, Dave Challis wrote:
> On 22/06/11 16:05, Kingsley Idehen wrote:
>> On 6/22/11 3:57 PM, Steve Harris wrote:
>>> Yes, exactly.
>>>
>>> I think that the problem is at least partly (and I say this as an
>>> ex-academic) that few people in academia have the slightest idea how
>>> much it costs to run a farm of servers in the Real World™.
>>>
>>> From the point of view of the crawler they're trying to get as much
>>> data as possible in a short a time as possible, but don't realise that
>>> the poor guy at the other end just got his 95th percentile shot
>>> through the roof, and now has a several thousand dollar bandwidth bill
>>> heading his way.
>>>
>>> You can cap bandwidth, but that then might annoy paying customers,
>>> which is clearly not good.
>>
>> Yes, so we need QoS algorithms or heuristics capable of fine-grained
>> partitioning re. Who can do What, When, and Where :-)
>>
>> Kingsley
>
> There are plenty of these around when it comes to web traffic in 
> general.  For apache, I can think of ModSecurity 
> (http://www.modsecurity.org/) and mod_evasive 
> (http://www.zdziarski.com/blog/?page_id=442).
>
> Both of these will look at traffic patterns and dynamically blacklist 
> as needed.

How do they deal with "Who" without throwing baby out with bath water 
re. Linked Data ?

Innocent Linked Data consumer triggers a transitive crawl, all other 
visitors from that IP get on a blacklist? Nobody meant any harm. In 
RDBMS realm would it be reasonable to take any of the following actions:

1. Cut off marketing because someone triggered: SELECT * from Customers 
, as part of MS Query or MS Access usage
2. Cut off sales and/or marketing because as part of trying to grok SQL 
joins they generated a lot of Cartesian products.

You need granularity within the data access technology itself. WebID 
offers that to Linked Data. Linked Data is the evolution hitting the Web 
and redefining crawling in the process.

>
> ModSecurity also allows for custom rules to be written depending on 
> get/post content, so it should be perfectly feasible to set up rules 
> based on estimated/actual query cost (e.g. blacklist if client makes > 
> X requests per Y mins which return > Z triples).

How does it know about: 
http://kingsley.idehen.net/dataspace/person#this,  For better or for 
worse re. QoS?

>
> Can't see any reason why a hybrid approach couldn't be used, e.g. 
> apply rules to unauthenticated traffic, and auto-whitelist clients 
> identifying themselves via WebID.

Of course a hybrid system is how it has to work. WebID isn't a silver 
bullet, nothing is. Hence the need for heuristics and algorithms. WebID 
is just a critical factor, ditto Trust Logic.


-- 

Regards,

Kingsley Idehen	
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Received on Wednesday, 22 June 2011 22:17:15 UTC