- From: Paola Di Maio <paola.dimaio@gmail.com>
- Date: Wed, 22 Jun 2011 12:31:14 +0100
- To: semantic-web at W3C <semantic-web@w3c.org>
- Message-ID: <BANLkTi=w9xN8TBaigsP0J8nnmqfmZeGS_Q@mail.gmail.com>
Martin I am sorry to hear, sounds unfair If anything, from what I understand, your effort is among the most rational in this field We can only make a valuable lesson (learning from failure). could this have been forseen? was any good practice ignored? would squid or other inhibiting mechanism enable crawl control? what's the lesson , the guideline for the future? could become a classing text book case for future sweb despair not P On Wed, Jun 22, 2011 at 10:42 AM, Martin Hepp < martin.hepp@ebusiness-unibw.org> wrote: > Just to inform the community that the BTC / research crawlers have been > successful in killing a major RDF source for e-commerce: > > OpenEAN - a transcript of >1 Mio product models and their EAN/UPC code at > http://openean.kaufkauf.net/id/ has been permanently shut down by the site > operator because fighting with bad semweb crawlers is taking too much of his > time. > > Thanks a lot for everybody who contributed to that. It trashes a month of > work and many million useful triples. > > Best > > Martin Hepp > > > > On Jun 22, 2011, at 11:37 AM, Yves Raimond wrote: > > > Hello! > > > >> The difference between these two scenarios is that there's almost no CPU > >> involvement in serving the PDF file, but naive RDF sites use lots of > cycles > >> to generate the response to a query for an RDF document. > >> > >> Right now queries to data.southampton.ac.uk (eg. > >> http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are > made > >> live, but this is not efficient. My colleague, Dave Challis, has > prepared a > >> SPARQL endpoint which caches results which we can turn on if the load > gets > >> too high, which should at least mitigate the problem. Very few datasets > >> change in a 24 hours period. > > > > Hmm, I would strongly argue it is not the case (and stale datasets are > > a bit issue in LOD imho!). The data on the BBC website, for example, > > changes approximately 10 times a second. > > > > We've also been hit in the past (and still now, to a lesser extent) by > > badly behaving crawlers. I agree that, as we don't provide dumps, it > > is the only way to generate an aggregation of BBC data, but we've had > > downtime in the past caused by crawlers. After that happened, it > > caused lots of discussions on whether we should publish RDF data at > > all (thankfully, we succeeded to argue that we should keep it - but > > that's a lot of time spent arguing instead of publishing new juicy RDF > > data!) > > > > I also want to point out (in response to Andreas's email) that HTTP > > caches are *completely* inefficient to protect a dataset against that, > > as crawlers tend to be exhaustive. ETags and Expiry headers are > > helpful, but chances are that 1) you don't know when the data will > > change, you can just make a wild guess based on previous behavior 2) > > the cache would have expired the time the crawler requests a document > > a second time, as it has ~100M (in our case) documents to crawl > > through. > > > > Request throttling would work, but you would have to find a way to > > identify crawlers, which is tricky: most of them use multiple IPs and > > don't set appropriate user agents (the crawlers that currently hit us > > the most are wget and Java 1.6 :/ ). > > > > So overall, there is no excuse for badly behaving crawlers! > > > > Cheers, > > y > > > >> > >> Martin Hepp wrote: > >> > >> Hi Daniel, > >> Thanks for the link! I will relay this to relevant site-owners. > >> > >> However, I still challenge Andreas' statement that the site-owners are > to > >> blame for publishing large amounts of data on small servers. > >> > >> One can publish 10,000 PDF documents on a tiny server without being hit > by > >> DoS-style crazy crawlers. Why should the same not hold if I publish RDF? > >> > >> But for sure, it is necessary to advise all publishers of large RDF > datasets > >> to protect themselves against hungry crawlers and actual DoS attacks. > >> > >> Imagine if a large site was brought down by a botnet that is exploiting > >> Semantic Sitemap information for DoS attacks, focussing on the large > dump > >> files. > >> This could end LOD experiments for that site. > >> > >> > >> Best > >> > >> Martin > >> > >> > >> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote: > >> > >> > >> > >> Hi Martin, > >> > >> Have you tried to put a Squid [1] as reverse proxy in front of your > servers > >> and use delay pools [2] to catch hungry crawlers? > >> > >> Cheers, > >> Daniel > >> > >> [1] http://www.squid-cache.org/ > >> [2] http://wiki.squid-cache.org/Features/DelayPools > >> > >> On 21.06.2011, at 09:49, Martin Hepp wrote: > >> > >> > >> > >> Hi all: > >> > >> For the third time in a few weeks, we had massive complaints from > >> site-owners that Semantic Web crawlers from Universities visited their > sites > >> in a way close to a denial-of-service attack, i.e., crawling data with > >> maximum bandwidth in a parallelized approach. > >> > >> It's clear that a single, stupidly written crawler script, run from a > >> powerful University network, can quickly create terrible traffic load. > >> > >> Many of the scripts we saw > >> > >> - ignored robots.txt, > >> - ignored clear crawling speed limitations in robots.txt, > >> - did not identify themselves properly in the HTTP request header or > lacked > >> contact information therein, > >> - used no mechanisms at all for limiting the default crawling speed and > >> re-crawling delays. > >> > >> This irresponsible behavior can be the final reason for site-owners to > say > >> farewell to academic/W3C-sponsored semantic technology. > >> > >> So please, please - advise all of your colleagues and students to NOT > write > >> simple crawler scripts for the billion triples challenge or whatsoever > >> without familiarizing themselves with the state of the art in "friendly > >> crawling". > >> > >> Best wishes > >> > >> Martin Hepp > >> > >> > >> > >> > >> > >> -- > >> Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248 > >> > >> You should read the ECS Web Team blog: > http://blogs.ecs.soton.ac.uk/webteam/ > >> > > >
Received on Wednesday, 22 June 2011 11:31:43 UTC