- From: Kingsley Idehen <kidehen@openlinksw.com>
- Date: Wed, 22 Jun 2011 11:43:18 +0100
- To: public-lod@w3.org
On 6/22/11 10:42 AM, Martin Hepp wrote: > Just to inform the community that the BTC / research crawlers have been successful in killing a major RDF source for e-commerce: > > OpenEAN - a transcript of>1 Mio product models and their EAN/UPC code at http://openean.kaufkauf.net/id/ has been permanently shut down by the site operator because fighting with bad semweb crawlers is taking too much of his time. > > Thanks a lot for everybody who contributed to that. It trashes a month of work and many million useful triples. Martin, Is there a dump anywhere? Can they at least continue to produce RDF dumps? We have some of their data (from prior dump loads) in our lod cloud cache [1]. Links: 1. http://lod.openlinksw.com/describe/?url=http%3A%2F%2Fopenean.kaufkauf.net%2Fid%2F&urilookup=1 Kingsley > Best > > Martin Hepp > > > > On Jun 22, 2011, at 11:37 AM, Yves Raimond wrote: > >> Hello! >> >>> The difference between these two scenarios is that there's almost no CPU >>> involvement in serving the PDF file, but naive RDF sites use lots of cycles >>> to generate the response to a query for an RDF document. >>> >>> Right now queries to data.southampton.ac.uk (eg. >>> http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made >>> live, but this is not efficient. My colleague, Dave Challis, has prepared a >>> SPARQL endpoint which caches results which we can turn on if the load gets >>> too high, which should at least mitigate the problem. Very few datasets >>> change in a 24 hours period. >> Hmm, I would strongly argue it is not the case (and stale datasets are >> a bit issue in LOD imho!). The data on the BBC website, for example, >> changes approximately 10 times a second. >> >> We've also been hit in the past (and still now, to a lesser extent) by >> badly behaving crawlers. I agree that, as we don't provide dumps, it >> is the only way to generate an aggregation of BBC data, but we've had >> downtime in the past caused by crawlers. After that happened, it >> caused lots of discussions on whether we should publish RDF data at >> all (thankfully, we succeeded to argue that we should keep it - but >> that's a lot of time spent arguing instead of publishing new juicy RDF >> data!) >> >> I also want to point out (in response to Andreas's email) that HTTP >> caches are *completely* inefficient to protect a dataset against that, >> as crawlers tend to be exhaustive. ETags and Expiry headers are >> helpful, but chances are that 1) you don't know when the data will >> change, you can just make a wild guess based on previous behavior 2) >> the cache would have expired the time the crawler requests a document >> a second time, as it has ~100M (in our case) documents to crawl >> through. >> >> Request throttling would work, but you would have to find a way to >> identify crawlers, which is tricky: most of them use multiple IPs and >> don't set appropriate user agents (the crawlers that currently hit us >> the most are wget and Java 1.6 :/ ). >> >> So overall, there is no excuse for badly behaving crawlers! >> >> Cheers, >> y >> >>> Martin Hepp wrote: >>> >>> Hi Daniel, >>> Thanks for the link! I will relay this to relevant site-owners. >>> >>> However, I still challenge Andreas' statement that the site-owners are to >>> blame for publishing large amounts of data on small servers. >>> >>> One can publish 10,000 PDF documents on a tiny server without being hit by >>> DoS-style crazy crawlers. Why should the same not hold if I publish RDF? >>> >>> But for sure, it is necessary to advise all publishers of large RDF datasets >>> to protect themselves against hungry crawlers and actual DoS attacks. >>> >>> Imagine if a large site was brought down by a botnet that is exploiting >>> Semantic Sitemap information for DoS attacks, focussing on the large dump >>> files. >>> This could end LOD experiments for that site. >>> >>> >>> Best >>> >>> Martin >>> >>> >>> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote: >>> >>> >>> >>> Hi Martin, >>> >>> Have you tried to put a Squid [1] as reverse proxy in front of your servers >>> and use delay pools [2] to catch hungry crawlers? >>> >>> Cheers, >>> Daniel >>> >>> [1] http://www.squid-cache.org/ >>> [2] http://wiki.squid-cache.org/Features/DelayPools >>> >>> On 21.06.2011, at 09:49, Martin Hepp wrote: >>> >>> >>> >>> Hi all: >>> >>> For the third time in a few weeks, we had massive complaints from >>> site-owners that Semantic Web crawlers from Universities visited their sites >>> in a way close to a denial-of-service attack, i.e., crawling data with >>> maximum bandwidth in a parallelized approach. >>> >>> It's clear that a single, stupidly written crawler script, run from a >>> powerful University network, can quickly create terrible traffic load. >>> >>> Many of the scripts we saw >>> >>> - ignored robots.txt, >>> - ignored clear crawling speed limitations in robots.txt, >>> - did not identify themselves properly in the HTTP request header or lacked >>> contact information therein, >>> - used no mechanisms at all for limiting the default crawling speed and >>> re-crawling delays. >>> >>> This irresponsible behavior can be the final reason for site-owners to say >>> farewell to academic/W3C-sponsored semantic technology. >>> >>> So please, please - advise all of your colleagues and students to NOT write >>> simple crawler scripts for the billion triples challenge or whatsoever >>> without familiarizing themselves with the state of the art in "friendly >>> crawling". >>> >>> Best wishes >>> >>> Martin Hepp >>> >>> >>> >>> >>> >>> -- >>> Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248 >>> >>> You should read the ECS Web Team blog: http://blogs.ecs.soton.ac.uk/webteam/ >>> > > -- Regards, Kingsley Idehen President& CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Received on Wednesday, 22 June 2011 10:43:44 UTC