Fwd: Think before you write Semantic Web crawlers from Paola Di Maio on 2011-06-22 (semantic-web@w3.org from June 2011)

From: Paola Di Maio <paola.dimaio@gmail.com>
Date: Wed, 22 Jun 2011 12:31:14 +0100
To: semantic-web at W3C <semantic-web@w3c.org>
Message-ID: <BANLkTi=w9xN8TBaigsP0J8nnmqfmZeGS_Q@mail.gmail.com>
Martin

I am sorry to hear, sounds unfair
If anything, from what I understand, your effort is among the most
rational in this field

We can only make a valuable lesson (learning from failure).
could this have been forseen?
was any good practice ignored?
would squid or other inhibiting mechanism enable crawl control?
what's the lesson , the guideline for the future?

could become a classing text book case for future sweb

despair not


P

On Wed, Jun 22, 2011 at 10:42 AM, Martin Hepp <
martin.hepp@ebusiness-unibw.org> wrote:

> Just to inform the community that the BTC / research crawlers have been
> successful in killing a major RDF source for e-commerce:
>
> OpenEAN - a transcript of >1 Mio product models and their EAN/UPC code at
> http://openean.kaufkauf.net/id/ has been permanently shut down by the site
> operator because fighting with bad semweb crawlers is taking too much of his
> time.
>
> Thanks a lot for everybody who contributed to that. It trashes a month of
> work and many million useful triples.
>
> Best
>
> Martin Hepp
>
>
>
> On Jun 22, 2011, at 11:37 AM, Yves Raimond wrote:
>
> > Hello!
> >
> >> The difference between these two scenarios is that there's almost no CPU
> >> involvement in serving the PDF file, but naive RDF sites use lots of
> cycles
> >> to generate the response to a query for an RDF document.
> >>
> >> Right now queries to data.southampton.ac.uk (eg.
> >> http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are
> made
> >> live, but this is not efficient. My colleague, Dave Challis, has
> prepared a
> >> SPARQL endpoint which caches results which we can turn on if the load
> gets
> >> too high, which should at least mitigate the problem. Very few datasets
> >> change in a 24 hours period.
> >
> > Hmm, I would strongly argue it is not the case (and stale datasets are
> > a bit issue in LOD imho!). The data on the BBC website, for example,
> > changes approximately 10 times a second.
> >
> > We've also been hit in the past (and still now, to a lesser extent) by
> > badly behaving crawlers. I agree that, as we don't provide dumps, it
> > is the only way to generate an aggregation of BBC data, but we've had
> > downtime in the past caused by crawlers. After that happened, it
> > caused lots of discussions on whether we should publish RDF data at
> > all (thankfully, we succeeded to argue that we should keep it - but
> > that's a lot of time spent arguing instead of publishing new juicy RDF
> > data!)
> >
> > I also want to point out (in response to Andreas's email) that HTTP
> > caches are *completely* inefficient to protect a dataset against that,
> > as crawlers tend to be exhaustive. ETags and Expiry headers are
> > helpful, but chances are that 1) you don't know when the data will
> > change, you can just make a wild guess based on previous behavior 2)
> > the cache would have expired the time the crawler requests a document
> > a second time, as it has ~100M (in our case) documents to crawl
> > through.
> >
> > Request throttling would work, but you would have to find a way to
> > identify crawlers, which is tricky: most of them use multiple IPs and
> > don't set appropriate user agents (the crawlers that currently hit us
> > the most are wget and Java 1.6 :/ ).
> >
> > So overall, there is no excuse for badly behaving crawlers!
> >
> > Cheers,
> > y
> >
> >>
> >> Martin Hepp wrote:
> >>
> >> Hi Daniel,
> >> Thanks for the link! I will relay this to relevant site-owners.
> >>
> >> However, I still challenge Andreas' statement that the site-owners are
> to
> >> blame for publishing large amounts of data on small servers.
> >>
> >> One can publish 10,000 PDF documents on a tiny server without being hit
> by
> >> DoS-style crazy crawlers. Why should the same not hold if I publish RDF?
> >>
> >> But for sure, it is necessary to advise all publishers of large RDF
> datasets
> >> to protect themselves against hungry crawlers and actual DoS attacks.
> >>
> >> Imagine if a large site was brought down by a botnet that is exploiting
> >> Semantic Sitemap information for DoS attacks, focussing on the large
> dump
> >> files.
> >> This could end LOD experiments for that site.
> >>
> >>
> >> Best
> >>
> >> Martin
> >>
> >>
> >> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:
> >>
> >>
> >>
> >> Hi Martin,
> >>
> >> Have you tried to put a Squid [1]  as reverse proxy in front of your
> servers
> >> and use delay pools [2] to catch hungry crawlers?
> >>
> >> Cheers,
> >> Daniel
> >>
> >> [1] http://www.squid-cache.org/
> >> [2] http://wiki.squid-cache.org/Features/DelayPools
> >>
> >> On 21.06.2011, at 09:49, Martin Hepp wrote:
> >>
> >>
> >>
> >> Hi all:
> >>
> >> For the third time in a few weeks, we had massive complaints from
> >> site-owners that Semantic Web crawlers from Universities visited their
> sites
> >> in a way close to a denial-of-service attack, i.e., crawling data with
> >> maximum bandwidth in a parallelized approach.
> >>
> >> It's clear that a single, stupidly written crawler script, run from a
> >> powerful University network, can quickly create terrible traffic load.
> >>
> >> Many of the scripts we saw
> >>
> >> - ignored robots.txt,
> >> - ignored clear crawling speed limitations in robots.txt,
> >> - did not identify themselves properly in the HTTP request header or
> lacked
> >> contact information therein,
> >> - used no mechanisms at all for limiting the default crawling speed and
> >> re-crawling delays.
> >>
> >> This irresponsible behavior can be the final reason for site-owners to
> say
> >> farewell to academic/W3C-sponsored semantic technology.
> >>
> >> So please, please - advise all of your colleagues and students to NOT
> write
> >> simple crawler scripts for the billion triples challenge or whatsoever
> >> without familiarizing themselves with the state of the art in "friendly
> >> crawling".
> >>
> >> Best wishes
> >>
> >> Martin Hepp
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248
> >>
> >> You should read the ECS Web Team blog:
> http://blogs.ecs.soton.ac.uk/webteam/
> >>
>
>
>
Received on Wednesday, 22 June 2011 11:31:43 UTC