- From: Martin Hepp <martin.hepp@ebusiness-unibw.org>
- Date: Wed, 22 Jun 2011 11:42:58 +0200
- To: Yves Raimond <yves.raimond@gmail.com>
- Cc: Christopher Gutteridge <cjg@ecs.soton.ac.uk>, Daniel Herzig <herzig@kit.edu>, semantic-web@w3.org, public-lod@w3.org
Just to inform the community that the BTC / research crawlers have been successful in killing a major RDF source for e-commerce: OpenEAN - a transcript of >1 Mio product models and their EAN/UPC code at http://openean.kaufkauf.net/id/ has been permanently shut down by the site operator because fighting with bad semweb crawlers is taking too much of his time. Thanks a lot for everybody who contributed to that. It trashes a month of work and many million useful triples. Best Martin Hepp On Jun 22, 2011, at 11:37 AM, Yves Raimond wrote: > Hello! > >> The difference between these two scenarios is that there's almost no CPU >> involvement in serving the PDF file, but naive RDF sites use lots of cycles >> to generate the response to a query for an RDF document. >> >> Right now queries to data.southampton.ac.uk (eg. >> http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made >> live, but this is not efficient. My colleague, Dave Challis, has prepared a >> SPARQL endpoint which caches results which we can turn on if the load gets >> too high, which should at least mitigate the problem. Very few datasets >> change in a 24 hours period. > > Hmm, I would strongly argue it is not the case (and stale datasets are > a bit issue in LOD imho!). The data on the BBC website, for example, > changes approximately 10 times a second. > > We've also been hit in the past (and still now, to a lesser extent) by > badly behaving crawlers. I agree that, as we don't provide dumps, it > is the only way to generate an aggregation of BBC data, but we've had > downtime in the past caused by crawlers. After that happened, it > caused lots of discussions on whether we should publish RDF data at > all (thankfully, we succeeded to argue that we should keep it - but > that's a lot of time spent arguing instead of publishing new juicy RDF > data!) > > I also want to point out (in response to Andreas's email) that HTTP > caches are *completely* inefficient to protect a dataset against that, > as crawlers tend to be exhaustive. ETags and Expiry headers are > helpful, but chances are that 1) you don't know when the data will > change, you can just make a wild guess based on previous behavior 2) > the cache would have expired the time the crawler requests a document > a second time, as it has ~100M (in our case) documents to crawl > through. > > Request throttling would work, but you would have to find a way to > identify crawlers, which is tricky: most of them use multiple IPs and > don't set appropriate user agents (the crawlers that currently hit us > the most are wget and Java 1.6 :/ ). > > So overall, there is no excuse for badly behaving crawlers! > > Cheers, > y > >> >> Martin Hepp wrote: >> >> Hi Daniel, >> Thanks for the link! I will relay this to relevant site-owners. >> >> However, I still challenge Andreas' statement that the site-owners are to >> blame for publishing large amounts of data on small servers. >> >> One can publish 10,000 PDF documents on a tiny server without being hit by >> DoS-style crazy crawlers. Why should the same not hold if I publish RDF? >> >> But for sure, it is necessary to advise all publishers of large RDF datasets >> to protect themselves against hungry crawlers and actual DoS attacks. >> >> Imagine if a large site was brought down by a botnet that is exploiting >> Semantic Sitemap information for DoS attacks, focussing on the large dump >> files. >> This could end LOD experiments for that site. >> >> >> Best >> >> Martin >> >> >> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote: >> >> >> >> Hi Martin, >> >> Have you tried to put a Squid [1] as reverse proxy in front of your servers >> and use delay pools [2] to catch hungry crawlers? >> >> Cheers, >> Daniel >> >> [1] http://www.squid-cache.org/ >> [2] http://wiki.squid-cache.org/Features/DelayPools >> >> On 21.06.2011, at 09:49, Martin Hepp wrote: >> >> >> >> Hi all: >> >> For the third time in a few weeks, we had massive complaints from >> site-owners that Semantic Web crawlers from Universities visited their sites >> in a way close to a denial-of-service attack, i.e., crawling data with >> maximum bandwidth in a parallelized approach. >> >> It's clear that a single, stupidly written crawler script, run from a >> powerful University network, can quickly create terrible traffic load. >> >> Many of the scripts we saw >> >> - ignored robots.txt, >> - ignored clear crawling speed limitations in robots.txt, >> - did not identify themselves properly in the HTTP request header or lacked >> contact information therein, >> - used no mechanisms at all for limiting the default crawling speed and >> re-crawling delays. >> >> This irresponsible behavior can be the final reason for site-owners to say >> farewell to academic/W3C-sponsored semantic technology. >> >> So please, please - advise all of your colleagues and students to NOT write >> simple crawler scripts for the billion triples challenge or whatsoever >> without familiarizing themselves with the state of the art in "friendly >> crawling". >> >> Best wishes >> >> Martin Hepp >> >> >> >> >> >> -- >> Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248 >> >> You should read the ECS Web Team blog: http://blogs.ecs.soton.ac.uk/webteam/ >>
Received on Wednesday, 22 June 2011 09:43:35 UTC