W3C home > Mailing lists > Public > public-lod@w3.org > June 2011

Re: Think before you write Semantic Web crawlers

From: Henry Story <henry.story@bblfish.net>
Date: Thu, 23 Jun 2011 00:26:06 +0200
Cc: Richard Cyganiak <richard@cyganiak.de>, Martin Hepp <martin.hepp@ebusiness-unibw.org>, public-lod@w3.org
Message-Id: <24B905FC-D454-4C8D-8415-F41A87DAE343@bblfish.net>
To: Alexandre Passant <alexandre.passant@deri.org>

On 23 Jun 2011, at 00:11, Alexandre Passant wrote:

> On 22 Jun 2011, at 22:49, Richard Cyganiak wrote:
>> On 21 Jun 2011, at 10:44, Martin Hepp wrote:
>>> PS: I will not release the IP ranges from which the trouble originated, but rest assured, there were top research institutions among them.
>> The right answer is: name and shame. That is the way to teach them.
> You may have find the right word: teach.
> We've (as academic) given tutorials on how to publish and consume LOD, lots of things about best practices for publishing, but not much about consuming.
> Why not simply coming with reasonable guidelines for this, that should also be taught in institutes / universities where people use LOD, and in tutorials given in various conferences.

That is of course a good idea. But longer term you don't want to teach that way. It's too time consuming. You need the machines to do the teaching. 

Think about Facebook. How did 500 million people go to use it? Because they were introduced by friends, by using it, but not by doing tutorials and going to courses. The system itself teaches people how to use it. 

So the same way, if you want to teach people linked data, get the social web going and they will learn the rest by themselves. If you want to teach crawlers to behave, make bad behaviour uninteresting. Create a game and rules where good behaviour are rewarded and bad behaviour has the opposite effect.

This is why I think using WebID can help. You can use the information to build lists and rankings of good and bad crawlers, people with good crawlers get to present papers and crawling confs, bad crawlers get throttled out of crawling.  Make it so that the system can grow beyond academic and teaching settings, into the world of billions of users spread across the world, living in different political institutions and speaking different languages. We have had good crawling practices since the beginning of the web, but you need to make them evident and self teaching.

EG. A crawler that crawls to much will get slowed down, and redirected to pages on crawling behavior, written and translated into every single language on the planet.


> m2c
> Alex.
>> Like Karl said, we should collect information about abusive crawlers so that site operators can defend themselves. It won't be *that* hard to research and collect the IP ranges of offending universities.
>> I started a list here:
>> http://www.w3.org/wiki/Bad_Crawlers
>> The list is currently empty. I hope it stays that way.
>> Thank you all,
>> Richard
> --
> Dr. Alexandre Passant, 
> Social Software Unit Leader
> Digital Enterprise Research Institute, 
> National University of Ireland, Galway

Social Web Architect
Received on Wednesday, 22 June 2011 22:26:38 UTC

This archive was generated by hypermail 2.4.0 : Thursday, 24 March 2022 20:29:54 UTC