Re: Think before you write Semantic Web crawlers from Henry Story on 2011-06-23 (public-lod@w3.org from June 2011)

From: Henry Story <henry.story@bblfish.net>
Date: Thu, 23 Jun 2011 13:21:06 +0200
To: Michael Brunnbauer <brunni@netestate.de>
Cc: Kingsley Idehen <kidehen@openlinksw.com>, public-lod@w3.org
Message-Id: <8C9E471D-9F9D-4D27-B2D2-115AF009DDC7@bblfish.net>

On 23 Jun 2011, at 13:13, Michael Brunnbauer wrote:

> 
> On Thu, Jun 23, 2011 at 11:32:43AM +0100, Kingsley Idehen wrote:
>>> config = {
>>> 'Googlebot':['googlebot.com'],
>>> 'Mediapartners-Google':['googlebot.com'],
>>> 'msnbot':['live.com','msn.com','bing.com'],
>>> 'bingbot':['live.com','msn.com','bing.com'],
>>> 'Yahoo! Slurp':['yahoo.com','yahoo.net']
>>> }
>> How does that deal with a DoS query inadvertently or deliberately 
>> generated by a SPARQL user agent?
> 
> It's part of the solution. It prevents countermeasures hitting the crawlers
> that are welcome.
> 
> How does WebID deal with it - except that it allows more fine grained ACLs per
> person/agent instead of DNS domain ? WebID is a cool thing and maybe crawlers
> will use it in the future but Martin needs solutions right now.

I'd emphase the above: it allows *Much* more fine grained ACLs. It's the difference between a police that would throw all gypsies into jail because it had some information leading them to think one gypsy stole something, and a police that would find the guilty person and just put him to jail.

Not only does it allow finer grained ACLs but it would allow agents to identify themselves: say as crawlers or end users. A Crawler could quickly be guided to the relevant dump file or RSS feeds, so that he does not need to waste resources on the server. It then allows the user/crawler to tie into linked data, which then means that we are applying recursively linked data to solve a linked data problem. That's the neat bit :-)

Henry

Social Web Architect
http://bblfish.net/

Received on Thursday, 23 June 2011 11:21:46 UTC