W3C home > Mailing lists > Public > public-lod@w3.org > June 2011

Re: Think before you write Semantic Web crawlers

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Thu, 23 Jun 2011 08:16:30 +0100
Message-ID: <4E02E84E.9000009@openlinksw.com>
To: public-lod@w3.org
On 6/22/11 11:26 PM, Henry Story wrote:
> On 23 Jun 2011, at 00:11, Alexandre Passant wrote:
>> On 22 Jun 2011, at 22:49, Richard Cyganiak wrote:
>>> On 21 Jun 2011, at 10:44, Martin Hepp wrote:
>>>> PS: I will not release the IP ranges from which the trouble originated, but rest assured, there were top research institutions among them.
>>> The right answer is: name and shame. That is the way to teach them.
>> You may have find the right word: teach.
>> We've (as academic) given tutorials on how to publish and consume LOD, lots of things about best practices for publishing, but not much about consuming.
>> Why not simply coming with reasonable guidelines for this, that should also be taught in institutes / universities where people use LOD, and in tutorials given in various conferences.
> That is of course a good idea. But longer term you don't want to teach that way. It's too time consuming. You need the machines to do the teaching.
> Think about Facebook. How did 500 million people go to use it? Because they were introduced by friends, by using it, but not by doing tutorials and going to courses. The system itself teaches people how to use it.
> So the same way, if you want to teach people linked data, get the social web going and they will learn the rest by themselves. If you want to teach crawlers to behave, make bad behaviour uninteresting. Create a game and rules where good behaviour are rewarded and bad behaviour has the opposite effect.
> This is why I think using WebID can help. You can use the information to build lists and rankings of good and bad crawlers, people with good crawlers get to present papers and crawling confs, bad crawlers get throttled out of crawling.  Make it so that the system can grow beyond academic and teaching settings, into the world of billions of users spread across the world, living in different political institutions and speaking different languages. We have had good crawling practices since the beginning of the web, but you need to make them evident and self teaching.
> EG. A crawler that crawls to much will get slowed down, and redirected to pages on crawling behavior, written and translated into every single language on the planet.


That's the game in a nutshell!

We have to keep virtuous cycles at the core of the increasingly social Web.

> Henry
>> m2c
>> Alex.
>>> Like Karl said, we should collect information about abusive crawlers so that site operators can defend themselves. It won't be *that* hard to research and collect the IP ranges of offending universities.
>>> I started a list here:
>>> http://www.w3.org/wiki/Bad_Crawlers
>>> The list is currently empty. I hope it stays that way.
>>> Thank you all,
>>> Richard
>> --
>> Dr. Alexandre Passant,
>> Social Software Unit Leader
>> Digital Enterprise Research Institute,
>> National University of Ireland, Galway
> Social Web Architect
> http://bblfish.net/



Kingsley Idehen	
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen
Received on Thursday, 23 June 2011 07:16:53 UTC

This archive was generated by hypermail 2.4.0 : Thursday, 24 March 2022 20:29:54 UTC