Re: Think before you write Semantic Web crawlers from Henry Story on 2011-06-23 (public-lod@w3.org from June 2011)

From: Henry Story <henry.story@bblfish.net>
Date: Thu, 23 Jun 2011 10:41:30 +0200
To: Michael Brunnbauer <brunni@netestate.de>
Cc: Martin Hepp <martin.hepp@ebusiness-unibw.org>, public-lod@w3.org
Message-Id: <233CF287-090E-487B-9CDD-3731519900FA@bblfish.net>

On 23 Jun 2011, at 10:20, Michael Brunnbauer wrote:

> 
> re
> 
> On Thu, Jun 23, 2011 at 10:09:25AM +0200, Martin Hepp wrote:
>> Yes, WebID is out of question a good thing. I am not entirely sure, though, that you can make it a mandatory requirement for access to your site, because if a few major consumers do not use WebID for their crawlers, site-owners cannot block anonymous crawlers.
> 
> Google, Bing and Yahoo Authenticate themself via DNS: Do a reverse lookup for
> the IP, check for some well known domains and then do a forward lookup of the
> hostname and check if it matches the IP. Much simpler to implement than WebID.
> 
> config = {
> 'Googlebot':['googlebot.com'],
> 'Mediapartners-Google':['googlebot.com'],
> 'msnbot':['live.com','msn.com','bing.com'],
> 'bingbot':['live.com','msn.com','bing.com'],
> 'Yahoo! Slurp':['yahoo.com','yahoo.net']
> }

That looks simple like that. But when things start scaling this becomes a full time job. At AltaVista
there was a person dedicated to dealing with these types of rules (even when the company was under 50 people), 
to work out who was abusing the system, what types of throttles to apply etc. At the time the big issue was that a huge portion of the webtraffic came from a few AOL ip addresses. If someone misbehaved there would you throttle all of AOL? The same is certainly true now in much larger number. Are you going to throttle a whole ip block because of the bad behaviour of one individual? And as you see the above is still reasonable when the number of bots are limited to massive crawlers. But when everyone can crawl, working out who is who through IP addresses will not be possible.

Sure WebId is not implemented widely yet. But the semweb has the most to gain by its adoption, since it ties right into linked data - it was originally called foaf+ssl! WebID is not that difficult to implement, and since these data sets are being placed online to test the skills and quality of the engineers, why not put a few datasets online protected in different ways with WebIDs? This will help build knowledge:

 - to protect web services with WebID
 - to build clients that use WebID
 - to get feedback on how crawlers are behaving

(If you are worried about anonymity, you could have your crawler use a WebID that cannot be traced to an institution, and later when you have collected the data prove that you are in control of that WebID.)

So for crawler writers giving their crawler a webid is half a days work to get going. We have written WebId implementations for servers in a day or two. Of course in both cases one can always tune and tune and tune. But you can get going really quickly.

Henry

> 
> Regards,
> 
> Michael Brunnbauer
> 
> -- 
> ++  Michael Brunnbauer
> ++  netEstate GmbH
> ++  Geisenhausener Straße 11a
> ++  81379 München
> ++  Tel +49 89 32 19 77 80
> ++  Fax +49 89 32 19 77 89 
> ++  E-Mail brunni@netestate.de
> ++  http://www.netestate.de/
> ++
> ++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
> ++  USt-IdNr. DE221033342
> ++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
> ++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
> 

Social Web Architect
http://bblfish.net/

Received on Thursday, 23 June 2011 08:42:01 UTC