Re: Think before you write Semantic Web crawlers

On 21 Jun 2011, at 10:48, Christopher Gutteridge wrote:

> Would some kind of caching crawler mitigate this issue? Have someone write a well behaved crawler which allowed you to download a recent .ttl.tgz of various sites. Of course, that assumes the student is able to find such a cache.
> 
> Asking people nicely will only work in a very small community.

Well there again with WebID you could do it very nicely.

If the agent - call him :spider - requesting a page specifies in its WebID Profile something like

   :spider a web:Crawler 

Then you could just redirect him straight to the n3 dump of graphs of your site (I say graphs because your site not necessarily being consistent, the crawler may be interested in keeping information about which pages said what)
Redirect may be a bit harsh. So you could at first link him to the dump

 <> cralwerDump </cralwer-heaven>

If he was misbehaved you could force the redirect.

Henry


> 
> Henry Story wrote:
>> 
>> A solution to stupid crawlers would be to put the linked data behind https endpoints, and use WebID 
>> for authentication. You could still allow everyone access, but at least you would force the crawler to identify 
>> himself, and use these WebIDs to learn who was making the crawler. This could then be used as a piece of the evaluation of the quality of a semantic web stack.
>> 
>> Henry
>> 
>> 10 minute intro to WebID http://bblfish.net/blog/2011/05/25/ (in browsers, but the browser is not really necessary)
>> 
>> On 21 Jun 2011, at 09:49, Martin Hepp wrote:
>> 
>>   
>>> Hi all:
>>> 
>>> For the third time in a few weeks, we had massive complaints from site-owners that Semantic Web crawlers from Universities visited their sites in a way close to a denial-of-service attack, i.e., crawling data with maximum bandwidth in a parallelized approach.
>>> 
>>> It's clear that a single, stupidly written crawler script, run from a powerful University network, can quickly create terrible traffic load. 
>>> 
>>> Many of the scripts we saw
>>> 
>>> - ignored robots.txt,
>>> - ignored clear crawling speed limitations in robots.txt,
>>> - did not identify themselves properly in the HTTP request header or lacked contact information therein, 
>>> - used no mechanisms at all for limiting the default crawling speed and re-crawling delays.
>>> 
>>> This irresponsible behavior can be the final reason for site-owners to say farewell to academic/W3C-sponsored semantic technology.
>>> 
>>> So please, please - advise all of your colleagues and students to NOT write simple crawler scripts for the billion triples challenge or whatsoever without familiarizing themselves with the state of the art in "friendly crawling".
>>> 
>>> Best wishes
>>> 
>>> Martin Hepp
>>> 
>>>     
>> 
>> Social Web Architect
>> http://bblfish.net/
>> 
>> 
>>   
> 
> -- 
> Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248
> 
> You should read the ECS Web Team blog: http://blogs.ecs.soton.ac.uk/webteam/

Social Web Architect
http://bblfish.net/

Received on Tuesday, 21 June 2011 09:55:03 UTC