Re: Think before you write Semantic Web crawlers from Martin Hepp on 2011-06-22 (semantic-web@w3.org from June 2011)

From: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Date: Wed, 22 Jun 2011 11:42:58 +0200
To: Yves Raimond <yves.raimond@gmail.com>
Cc: Christopher Gutteridge <cjg@ecs.soton.ac.uk>, Daniel Herzig <herzig@kit.edu>, semantic-web@w3.org, public-lod@w3.org
Message-Id: <318603B9-6C50-4CF1-B919-339C0801811C@ebusiness-unibw.org>
Just to inform the community that the BTC / research crawlers have been successful in killing a major RDF source for e-commerce:

OpenEAN - a transcript of >1 Mio product models and their EAN/UPC code at http://openean.kaufkauf.net/id/ has been permanently shut down by the site operator because fighting with bad semweb crawlers is taking too much of his time.

Thanks a lot for everybody who contributed to that. It trashes a month of work and many million useful triples.

Best

Martin Hepp



On Jun 22, 2011, at 11:37 AM, Yves Raimond wrote:

> Hello!
> 
>> The difference between these two scenarios is that there's almost no CPU
>> involvement in serving the PDF file, but naive RDF sites use lots of cycles
>> to generate the response to a query for an RDF document.
>> 
>> Right now queries to data.southampton.ac.uk (eg.
>> http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made
>> live, but this is not efficient. My colleague, Dave Challis, has prepared a
>> SPARQL endpoint which caches results which we can turn on if the load gets
>> too high, which should at least mitigate the problem. Very few datasets
>> change in a 24 hours period.
> 
> Hmm, I would strongly argue it is not the case (and stale datasets are
> a bit issue in LOD imho!). The data on the BBC website, for example,
> changes approximately 10 times a second.
> 
> We've also been hit in the past (and still now, to a lesser extent) by
> badly behaving crawlers. I agree that, as we don't provide dumps, it
> is the only way to generate an aggregation of BBC data, but we've had
> downtime in the past caused by crawlers. After that happened, it
> caused lots of discussions on whether we should publish RDF data at
> all (thankfully, we succeeded to argue that we should keep it - but
> that's a lot of time spent arguing instead of publishing new juicy RDF
> data!)
> 
> I also want to point out (in response to Andreas's email) that HTTP
> caches are *completely* inefficient to protect a dataset against that,
> as crawlers tend to be exhaustive. ETags and Expiry headers are
> helpful, but chances are that 1) you don't know when the data will
> change, you can just make a wild guess based on previous behavior 2)
> the cache would have expired the time the crawler requests a document
> a second time, as it has ~100M (in our case) documents to crawl
> through.
> 
> Request throttling would work, but you would have to find a way to
> identify crawlers, which is tricky: most of them use multiple IPs and
> don't set appropriate user agents (the crawlers that currently hit us
> the most are wget and Java 1.6 :/ ).
> 
> So overall, there is no excuse for badly behaving crawlers!
> 
> Cheers,
> y
> 
>> 
>> Martin Hepp wrote:
>> 
>> Hi Daniel,
>> Thanks for the link! I will relay this to relevant site-owners.
>> 
>> However, I still challenge Andreas' statement that the site-owners are to
>> blame for publishing large amounts of data on small servers.
>> 
>> One can publish 10,000 PDF documents on a tiny server without being hit by
>> DoS-style crazy crawlers. Why should the same not hold if I publish RDF?
>> 
>> But for sure, it is necessary to advise all publishers of large RDF datasets
>> to protect themselves against hungry crawlers and actual DoS attacks.
>> 
>> Imagine if a large site was brought down by a botnet that is exploiting
>> Semantic Sitemap information for DoS attacks, focussing on the large dump
>> files.
>> This could end LOD experiments for that site.
>> 
>> 
>> Best
>> 
>> Martin
>> 
>> 
>> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:
>> 
>> 
>> 
>> Hi Martin,
>> 
>> Have you tried to put a Squid [1]  as reverse proxy in front of your servers
>> and use delay pools [2] to catch hungry crawlers?
>> 
>> Cheers,
>> Daniel
>> 
>> [1] http://www.squid-cache.org/
>> [2] http://wiki.squid-cache.org/Features/DelayPools
>> 
>> On 21.06.2011, at 09:49, Martin Hepp wrote:
>> 
>> 
>> 
>> Hi all:
>> 
>> For the third time in a few weeks, we had massive complaints from
>> site-owners that Semantic Web crawlers from Universities visited their sites
>> in a way close to a denial-of-service attack, i.e., crawling data with
>> maximum bandwidth in a parallelized approach.
>> 
>> It's clear that a single, stupidly written crawler script, run from a
>> powerful University network, can quickly create terrible traffic load.
>> 
>> Many of the scripts we saw
>> 
>> - ignored robots.txt,
>> - ignored clear crawling speed limitations in robots.txt,
>> - did not identify themselves properly in the HTTP request header or lacked
>> contact information therein,
>> - used no mechanisms at all for limiting the default crawling speed and
>> re-crawling delays.
>> 
>> This irresponsible behavior can be the final reason for site-owners to say
>> farewell to academic/W3C-sponsored semantic technology.
>> 
>> So please, please - advise all of your colleagues and students to NOT write
>> simple crawler scripts for the billion triples challenge or whatsoever
>> without familiarizing themselves with the state of the art in "friendly
>> crawling".
>> 
>> Best wishes
>> 
>> Martin Hepp
>> 
>> 
>> 
>> 
>> 
>> --
>> Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248
>> 
>> You should read the ECS Web Team blog: http://blogs.ecs.soton.ac.uk/webteam/
>>
Received on Wednesday, 22 June 2011 09:43:36 UTC