Re: Think before you write Semantic Web crawlers from Andreas Harth on 2011-06-22 (semantic-web@w3.org from June 2011)

From: Andreas Harth <andreas@harth.org>
Date: Wed, 22 Jun 2011 14:37:02 +0200
To: Martin Hepp <martin.hepp@ebusiness-unibw.org>
CC: Yves Raimond <yves.raimond@gmail.com>, Christopher Gutteridge <cjg@ecs.soton.ac.uk>, Daniel Herzig <herzig@kit.edu>, semantic-web@w3.org, public-lod@w3.org
Message-ID: <4E01E1EE.2080105@harth.org>

Hi Martin,

first let me say that I do think crawlers should follow basic politeness
rules (contact info in User-Agent, adhere to the Robot Exclusion Protocol).

However, I am delighted that people actually start consuming Linked Data,
and we should encourage that.

On 06/22/2011 11:42 AM, Martin Hepp wrote:
> OpenEAN - a transcript of>1 Mio product models and their EAN/UPC code at
> http://openean.kaufkauf.net/id/ has been permanently shut down by the site
> operator because fighting with bad semweb crawlers is taking too much of his
> time.

I've put a wrapper online [1] that provides RDF based on their API (which,
incidentally, currently does not seem to work either).

The wrapper does some caching and has a limit of one lookup every 8 seconds,
which means (24*60*60)/8 = 10800 lookups per day.  Data transfer is capped
to 1 GB/day, which means a maximum cost of 0.15 Euro/day at Amazon AWS pricing.

At that rate, it would take 925 days to collect descriptions of just one
million products.  Whether the ratio of data size and lookup limit is sensible
in that case is open to debate.

If the OpenEAN guys can redirect requests to [1] there would even be some
continuity for data consumers.

Best regards,
Andreas.

[1] http://openeanwrap.appspot.com/

Received on Wednesday, 22 June 2011 12:37:44 UTC