W3C home > Mailing lists > Public > public-lod@w3.org > June 2010

Re: Please stop massive crawling against http://openean.kaufkauf.net/id/

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Tue, 08 Jun 2010 09:38:28 -0400
Message-ID: <4C0E47D4.3070202@openlinksw.com>
To: Robert Fuller <robert.fuller@deri.org>
CC: public-lod@w3.org
Robert Fuller wrote:
> Hi,
> Sindice clearly identifies itself in the user agent http header. 
> Currently we use these user agents:
> 1. "Mozilla/5.0 (compatible; sindice-fetcher/0.1.0 
> +http://sindice.com/developers/bot)"
> 2. "SindiceFetcher/Ping Manager (http://sindice.com/developers/bot"
> 3. "sindice.net ontology fetcher"
> Niceness is implemented in our main fetcher. In some cases there may 
> be bursts on sites providing distributed ontologies. Speaking with the 
> group here it seems unlikely that we have not been hitting 
> kaufkauf.net,  however if you can provide an IP address I can do some 
> further verification.
> I understand that http://lod.openlinksw.com/sparql is now hosted at 
> DERI, and I wonder could some of the traffic be related to that? 
> Again, if you can provide an IP address I will do some further 
> verification.


As indicated by Martin, the <http://lod.openlinksw.com> instance hosted 
at DERI should negate the need to go back to the original source.


The LOD Cloud Cache at DERI is a live Virtuoso instance with 15 Billion+ 
Triples loaded. It covers as much of the LOD Cloud as we've be able to 
get our hands on plus 6.4 Billion Triples from the Data.Gov effort.

I'll drop a more detailed note about this instance (via blog post) once 
we are done with data loading (there's a massive collection of eCommerce 
oriented Products & Services data to be loaded amongst others).

> Kind regards,
> Rob.
> -- 
> Robert Fuller
> Research Associate
> DERI, Galway



Kingsley Idehen	      
President & CEO 
OpenLink Software     
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 
Received on Tuesday, 8 June 2010 13:39:21 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:21:01 UTC