Re: Think before you write Semantic Web crawlers from Kingsley Idehen on 2011-06-22 (public-lod@w3.org from June 2011)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Wed, 22 Jun 2011 11:43:18 +0100
To: public-lod@w3.org
Message-ID: <4E01C746.6060400@openlinksw.com>
On 6/22/11 10:42 AM, Martin Hepp wrote:
> Just to inform the community that the BTC / research crawlers have been successful in killing a major RDF source for e-commerce:
>
> OpenEAN - a transcript of>1 Mio product models and their EAN/UPC code at http://openean.kaufkauf.net/id/ has been permanently shut down by the site operator because fighting with bad semweb crawlers is taking too much of his time.
>
> Thanks a lot for everybody who contributed to that. It trashes a month of work and many million useful triples.

Martin,

Is there a dump anywhere? Can they at least continue to produce RDF dumps?

We have some of their data (from prior dump loads) in our lod cloud 
cache [1].

Links:

1. 
http://lod.openlinksw.com/describe/?url=http%3A%2F%2Fopenean.kaufkauf.net%2Fid%2F&urilookup=1


Kingsley
> Best
>
> Martin Hepp
>
>
>
> On Jun 22, 2011, at 11:37 AM, Yves Raimond wrote:
>
>> Hello!
>>
>>> The difference between these two scenarios is that there's almost no CPU
>>> involvement in serving the PDF file, but naive RDF sites use lots of cycles
>>> to generate the response to a query for an RDF document.
>>>
>>> Right now queries to data.southampton.ac.uk (eg.
>>> http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made
>>> live, but this is not efficient. My colleague, Dave Challis, has prepared a
>>> SPARQL endpoint which caches results which we can turn on if the load gets
>>> too high, which should at least mitigate the problem. Very few datasets
>>> change in a 24 hours period.
>> Hmm, I would strongly argue it is not the case (and stale datasets are
>> a bit issue in LOD imho!). The data on the BBC website, for example,
>> changes approximately 10 times a second.
>>
>> We've also been hit in the past (and still now, to a lesser extent) by
>> badly behaving crawlers. I agree that, as we don't provide dumps, it
>> is the only way to generate an aggregation of BBC data, but we've had
>> downtime in the past caused by crawlers. After that happened, it
>> caused lots of discussions on whether we should publish RDF data at
>> all (thankfully, we succeeded to argue that we should keep it - but
>> that's a lot of time spent arguing instead of publishing new juicy RDF
>> data!)
>>
>> I also want to point out (in response to Andreas's email) that HTTP
>> caches are *completely* inefficient to protect a dataset against that,
>> as crawlers tend to be exhaustive. ETags and Expiry headers are
>> helpful, but chances are that 1) you don't know when the data will
>> change, you can just make a wild guess based on previous behavior 2)
>> the cache would have expired the time the crawler requests a document
>> a second time, as it has ~100M (in our case) documents to crawl
>> through.
>>
>> Request throttling would work, but you would have to find a way to
>> identify crawlers, which is tricky: most of them use multiple IPs and
>> don't set appropriate user agents (the crawlers that currently hit us
>> the most are wget and Java 1.6 :/ ).
>>
>> So overall, there is no excuse for badly behaving crawlers!
>>
>> Cheers,
>> y
>>
>>> Martin Hepp wrote:
>>>
>>> Hi Daniel,
>>> Thanks for the link! I will relay this to relevant site-owners.
>>>
>>> However, I still challenge Andreas' statement that the site-owners are to
>>> blame for publishing large amounts of data on small servers.
>>>
>>> One can publish 10,000 PDF documents on a tiny server without being hit by
>>> DoS-style crazy crawlers. Why should the same not hold if I publish RDF?
>>>
>>> But for sure, it is necessary to advise all publishers of large RDF datasets
>>> to protect themselves against hungry crawlers and actual DoS attacks.
>>>
>>> Imagine if a large site was brought down by a botnet that is exploiting
>>> Semantic Sitemap information for DoS attacks, focussing on the large dump
>>> files.
>>> This could end LOD experiments for that site.
>>>
>>>
>>> Best
>>>
>>> Martin
>>>
>>>
>>> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:
>>>
>>>
>>>
>>> Hi Martin,
>>>
>>> Have you tried to put a Squid [1]  as reverse proxy in front of your servers
>>> and use delay pools [2] to catch hungry crawlers?
>>>
>>> Cheers,
>>> Daniel
>>>
>>> [1] http://www.squid-cache.org/
>>> [2] http://wiki.squid-cache.org/Features/DelayPools
>>>
>>> On 21.06.2011, at 09:49, Martin Hepp wrote:
>>>
>>>
>>>
>>> Hi all:
>>>
>>> For the third time in a few weeks, we had massive complaints from
>>> site-owners that Semantic Web crawlers from Universities visited their sites
>>> in a way close to a denial-of-service attack, i.e., crawling data with
>>> maximum bandwidth in a parallelized approach.
>>>
>>> It's clear that a single, stupidly written crawler script, run from a
>>> powerful University network, can quickly create terrible traffic load.
>>>
>>> Many of the scripts we saw
>>>
>>> - ignored robots.txt,
>>> - ignored clear crawling speed limitations in robots.txt,
>>> - did not identify themselves properly in the HTTP request header or lacked
>>> contact information therein,
>>> - used no mechanisms at all for limiting the default crawling speed and
>>> re-crawling delays.
>>>
>>> This irresponsible behavior can be the final reason for site-owners to say
>>> farewell to academic/W3C-sponsored semantic technology.
>>>
>>> So please, please - advise all of your colleagues and students to NOT write
>>> simple crawler scripts for the billion triples challenge or whatsoever
>>> without familiarizing themselves with the state of the art in "friendly
>>> crawling".
>>>
>>> Best wishes
>>>
>>> Martin Hepp
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248
>>>
>>> You should read the ECS Web Team blog: http://blogs.ecs.soton.ac.uk/webteam/
>>>
>
>


-- 

Regards,

Kingsley Idehen	
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen
Received on Wednesday, 22 June 2011 10:43:44 UTC