Re: Think before you write Semantic Web crawlers from Kingsley Idehen on 2011-06-23 (public-lod@w3.org from June 2011)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Thu, 23 Jun 2011 13:32:43 +0100
To: public-lod@w3.org
Message-ID: <4E03326B.1080003@openlinksw.com>
On 6/23/11 12:13 PM, Michael Brunnbauer wrote:
> re
>
> On Thu, Jun 23, 2011 at 11:32:43AM +0100, Kingsley Idehen wrote:
>>> config = {
>>> 'Googlebot':['googlebot.com'],
>>> 'Mediapartners-Google':['googlebot.com'],
>>> 'msnbot':['live.com','msn.com','bing.com'],
>>> 'bingbot':['live.com','msn.com','bing.com'],
>>> 'Yahoo! Slurp':['yahoo.com','yahoo.net']
>>> }
>> How does that deal with a DoS query inadvertently or deliberately
>> generated by a SPARQL user agent?
> It's part of the solution. It prevents countermeasures hitting the crawlers
> that are welcome.
>
> How does WebID deal with it - except that it allows more fine grained ACLs per
> person/agent instead of DNS domain ? WebID is a cool thing and maybe crawlers
> will use it in the future but Martin needs solutions right now.

Martin's problem isn't about right now. Yes, he used a specific example, 
but I can assure you it isn't about right now per se.

He can blacklist offenders today, but it doesn't solve the big picture 
issue.

You need granularity, no way around it. Logic has to be put to work, and 
having Logic within Data is they key to all of this. Has always been, 
the WWW has finally brought these matters to the fore, in a big way.

>> Google and friends are the real problem to come, its the inadvertent
>> SPARQL query that kicks off of a transitive crawl that's going to reek
>> havoc.

Google and friends aren't the problem, I meant to say.
> Are you talking about one agent crawling in an unfriendly way or 10.000 agents
> crawling in a friendly way but nethertheless constituting a DDOS ?

I am saying: we have a new Web dimension, a data space dimension, where 
the WWW is now a distributed DBMS (of sorts). Thus, DBMS issues that 
used to be private to the enterprise are now in the public domain. A 
Denial of Service (DoS) can occur in a myriad of ways (deliberate or 
inadvertent), the most challenging as per my Cartesian product reference 
in an earlier post.

In the information space dimension, crawling was/is an activity 
dominated by dedicated crawlers. In the Data Space dimension, crawling 
is a natural consequence of exploring (via FYN patterns) Linked Data 
meshes, at InterWeb scales.

People will start off with a click here and there, and then they'll 
generate some sparql (via user friendly tools that generate sparql), and 
ultimately they'll have agents doing all of this and more as part of 
natural evolution driven by pursuit of productivity.  Walking SKOS 
transitively or putting OWL to its ultimate use (smart traversal and 
integration of heterogeneous data) will make this happen. In a sense, 
the RDF induced uptake delays to Linked Data could actually be a 
blessing in disguise since the whole thing would have imploded on itself 
years ago based on experiences of the kind unveiled by Martin.

Users don't have any time or interest in an aggressively promoted WWW 
innovation that fails at hurdle #1 post adoption  i.e., they don't have 
time to wait for vendors to react and code as a response to oversights 
associated with integral implementation issues such as:

1. Data Access Policies
2. Infrastructure Costs.

> I think agents behaving unfriendly will not be used by people other than their
> authors.

See my comments above. The Web agent is changing already. The Web can 
now be queried like a SQL RDBMS of yore, but in much more sophisticated 
fashion :-)
> Regards,
>
> Michael Brunnbauer
>


-- 

Regards,

Kingsley Idehen	
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen
Received on Thursday, 23 June 2011 12:33:08 UTC