W3C home > Mailing lists > Public > public-lod@w3.org > June 2011

Re: Think before you write Semantic Web crawlers

From: Michael Brunnbauer <brunni@netestate.de>
Date: Thu, 23 Jun 2011 13:13:22 +0200
To: Kingsley Idehen <kidehen@openlinksw.com>
Cc: public-lod@w3.org
Message-ID: <20110623111322.GA22028@netestate.de>

re

On Thu, Jun 23, 2011 at 11:32:43AM +0100, Kingsley Idehen wrote:
> >config = {
> >'Googlebot':['googlebot.com'],
> >'Mediapartners-Google':['googlebot.com'],
> >'msnbot':['live.com','msn.com','bing.com'],
> >'bingbot':['live.com','msn.com','bing.com'],
> >'Yahoo! Slurp':['yahoo.com','yahoo.net']
> >}
> How does that deal with a DoS query inadvertently or deliberately 
> generated by a SPARQL user agent?

It's part of the solution. It prevents countermeasures hitting the crawlers
that are welcome.

How does WebID deal with it - except that it allows more fine grained ACLs per
person/agent instead of DNS domain ? WebID is a cool thing and maybe crawlers
will use it in the future but Martin needs solutions right now.

> Google and friends are the real problem to come, its the inadvertent 
> SPARQL query that kicks off of a transitive crawl that's going to reek 
> havoc.

Are you talking about one agent crawling in an unfriendly way or 10.000 agents
crawling in a friendly way but nethertheless constituting a DDOS ?

I think agents behaving unfriendly will not be used by people other than their
authors.

Regards,

Michael Brunnbauer

-- 
++  Michael Brunnbauer
++  netEstate GmbH
++  Geisenhausener Straße 11a
++  81379 München
++  Tel +49 89 32 19 77 80
++  Fax +49 89 32 19 77 89 
++  E-Mail brunni@netestate.de
++  http://www.netestate.de/
++
++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
++  USt-IdNr. DE221033342
++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
Received on Thursday, 23 June 2011 11:13:47 UTC

This archive was generated by hypermail 2.4.0 : Thursday, 24 March 2022 20:29:54 UTC