W3C home > Mailing lists > Public > public-lod@w3.org > June 2011

Re: Think before you write Semantic Web crawlers

From: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Date: Tue, 21 Jun 2011 12:41:42 +0200
Cc: public-lod@w3.org
Message-Id: <CD209282-F2BA-420B-9A3D-395BAE527758@ebusiness-unibw.org>
To: Daniel Herzig <herzig@kit.edu>
Thanks for the hint, but I am not talking about "my" servers.

I am talking about a site-owner somewhere in Kentucky running a small shop on www.godaddy.com who adds RDF to his site, informs PingTheSemanticWeb, and what he gets in turn are wild-west crawlers that bring down his tiny server by crawling the same data over and over again from a powerful University network.

These may be small nuisances from a historic perspective, but they may be the last trigger for the end of the academic/W3C part of the Semantic Web project.

As a side-note: Even if Google, Bing, and Yahoo say - "yes, folks - you can use RDFa and all of the fancy academic SW stuff if you absolutely want, and yes, we promise to not punish your sites" - which market share for RDFa and the traditional Web do you expect over Microdata and "their" way of adding structured data?

I bet a bottle of Champagne that the market-share of the academic Semantic Web movement will be less than 10 %, even if such a statement was made loud and clearly, in a year from now - not 10 % of the total Web, but 10 % of all structured Web content.

If you do not reach the SEO and Web hacker worlds, your project is dead, because even the greatest technology advantage that you may bring will be nothing against a 90 % dominance as far as the consumption of the data is concerned. And if you annoy site-owners, your are out of the game even quicker.


On Jun 21, 2011, at 12:04 PM, Daniel Herzig wrote:

> Hi Martin,
> Have you tried to put a Squid [1]  as reverse proxy in front of your servers and use delay pools [2] to catch hungry crawlers?
> May be that helps.
> Cheers,
> Daniel
> [1] http://www.squid-cache.org/
> [2] http://wiki.squid-cache.org/Features/DelayPools
Received on Tuesday, 21 June 2011 10:42:18 UTC

This archive was generated by hypermail 2.4.0 : Thursday, 24 March 2022 20:29:54 UTC