Re: Think before you write Semantic Web crawlers from Lin Clark on 2011-06-22 (semantic-web@w3.org from June 2011)

From: Lin Clark <lin.w.clark@gmail.com>
Date: Wed, 22 Jun 2011 12:31:45 +0100
To: adam.saltiel@gmail.com
Cc: Martin Hepp <martin.hepp@ebusiness-unibw.org>, semantic-web-request@w3.org, Yves Raimond <yves.raimond@gmail.com>, Christopher Gutteridge <cjg@ecs.soton.ac.uk>, Daniel Herzig <herzig@kit.edu>, semantic-web@w3.org, public-lod@w3.org
Message-ID: <BANLkTike3LVLTwo5067m737OG2DfL3sOEA@mail.gmail.com>
I was with you on this until the cathedral and the bazaar thing... I think
it is a serious misreading of cathedral and bazaar to think that if
something is naive and irresponsible, it is by definition bazaar style
development. Bazaar style is about how code is developed (in the open by a
loosely organized and fluctuating group of developers). Cathedral means that
it is a smaller, generally hierarchically organized group which doesn't work
in a public, open way between releases. There is a good summary on
wikipedia, http://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar

The bazaar style of development can lead to things that are more responsible
than their cathedral counterparts. Bazaar means continuously documenting
your decisions in the public, posting patches for testing and review by
everyone (not just your extremely busy team mates), and opening your dev
process to co-developers who you don't already know. These organizational
strategies have lead to some REALLY BIG engineering wins... and these
engineering wins have resulted in more responsible products than their
cathedral-built counterparts.

I also would question the assertion that people want cathedrals... the
general direction on the Web seems to be away from cathedrals like Microsoft
and Flash and towards bazaar developed solutions.

However, the call to responsibility is still a very valid one. I'm quite
sorry to hear that a large data publisher has been pushed out of the
community effort by people who should be working on the same team.

-Lin


On Wed, Jun 22, 2011 at 11:59 AM, <adam.saltiel@gmail.com> wrote:

> Yes. But are there things such as Squid and WebId that can be instituted
> the provider side? This is an interesting moment. Is it the academic SemWeb
> running out of public facing steam. A retreat. Or is it a moment of
> transition from naivety to responsibility. When we think about the Cathedral
> and the Bazaar. There is a reason why people want Cathedrals. I suggest
> SemWeb is about Cathedrals. Responsibility for some order and structure.
>
> Adam
> Sent using BlackBerry® from Orange
>
> -----Original Message-----
> From: Martin Hepp <martin.hepp@ebusiness-unibw.org>
> Sender: semantic-web-request@w3.org
> Date: Wed, 22 Jun 2011 11:42:58
> To: Yves Raimond<yves.raimond@gmail.com>
> Cc: Christopher Gutteridge<cjg@ecs.soton.ac.uk>; Daniel Herzig<
> herzig@kit.edu>; <semantic-web@w3.org>; <public-lod@w3.org>
> Subject: Re: Think before you write Semantic Web crawlers
>
> Just to inform the community that the BTC / research crawlers have been
> successful in killing a major RDF source for e-commerce:
>
> OpenEAN - a transcript of >1 Mio product models and their EAN/UPC code at
> http://openean.kaufkauf.net/id/ has been permanently shut down by the site
> operator because fighting with bad semweb crawlers is taking too much of his
> time.
>
> Thanks a lot for everybody who contributed to that. It trashes a month of
> work and many million useful triples.
>
> Best
>
> Martin Hepp
>
>
>
> On Jun 22, 2011, at 11:37 AM, Yves Raimond wrote:
>
> > Hello!
> >
> >> The difference between these two scenarios is that there's almost no CPU
> >> involvement in serving the PDF file, but naive RDF sites use lots of
> cycles
> >> to generate the response to a query for an RDF document.
> >>
> >> Right now queries to data.southampton.ac.uk (eg.
> >> http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are
> made
> >> live, but this is not efficient. My colleague, Dave Challis, has
> prepared a
> >> SPARQL endpoint which caches results which we can turn on if the load
> gets
> >> too high, which should at least mitigate the problem. Very few datasets
> >> change in a 24 hours period.
> >
> > Hmm, I would strongly argue it is not the case (and stale datasets are
> > a bit issue in LOD imho!). The data on the BBC website, for example,
> > changes approximately 10 times a second.
> >
> > We've also been hit in the past (and still now, to a lesser extent) by
> > badly behaving crawlers. I agree that, as we don't provide dumps, it
> > is the only way to generate an aggregation of BBC data, but we've had
> > downtime in the past caused by crawlers. After that happened, it
> > caused lots of discussions on whether we should publish RDF data at
> > all (thankfully, we succeeded to argue that we should keep it - but
> > that's a lot of time spent arguing instead of publishing new juicy RDF
> > data!)
> >
> > I also want to point out (in response to Andreas's email) that HTTP
> > caches are *completely* inefficient to protect a dataset against that,
> > as crawlers tend to be exhaustive. ETags and Expiry headers are
> > helpful, but chances are that 1) you don't know when the data will
> > change, you can just make a wild guess based on previous behavior 2)
> > the cache would have expired the time the crawler requests a document
> > a second time, as it has ~100M (in our case) documents to crawl
> > through.
> >
> > Request throttling would work, but you would have to find a way to
> > identify crawlers, which is tricky: most of them use multiple IPs and
> > don't set appropriate user agents (the crawlers that currently hit us
> > the most are wget and Java 1.6 :/ ).
> >
> > So overall, there is no excuse for badly behaving crawlers!
> >
> > Cheers,
> > y
> >
> >>
> >> Martin Hepp wrote:
> >>
> >> Hi Daniel,
> >> Thanks for the link! I will relay this to relevant site-owners.
> >>
> >> However, I still challenge Andreas' statement that the site-owners are
> to
> >> blame for publishing large amounts of data on small servers.
> >>
> >> One can publish 10,000 PDF documents on a tiny server without being hit
> by
> >> DoS-style crazy crawlers. Why should the same not hold if I publish RDF?
> >>
> >> But for sure, it is necessary to advise all publishers of large RDF
> datasets
> >> to protect themselves against hungry crawlers and actual DoS attacks.
> >>
> >> Imagine if a large site was brought down by a botnet that is exploiting
> >> Semantic Sitemap information for DoS attacks, focussing on the large
> dump
> >> files.
> >> This could end LOD experiments for that site.
> >>
> >>
> >> Best
> >>
> >> Martin
> >>
> >>
> >> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:
> >>
> >>
> >>
> >> Hi Martin,
> >>
> >> Have you tried to put a Squid [1]  as reverse proxy in front of your
> servers
> >> and use delay pools [2] to catch hungry crawlers?
> >>
> >> Cheers,
> >> Daniel
> >>
> >> [1] http://www.squid-cache.org/
> >> [2] http://wiki.squid-cache.org/Features/DelayPools
> >>
> >> On 21.06.2011, at 09:49, Martin Hepp wrote:
> >>
> >>
> >>
> >> Hi all:
> >>
> >> For the third time in a few weeks, we had massive complaints from
> >> site-owners that Semantic Web crawlers from Universities visited their
> sites
> >> in a way close to a denial-of-service attack, i.e., crawling data with
> >> maximum bandwidth in a parallelized approach.
> >>
> >> It's clear that a single, stupidly written crawler script, run from a
> >> powerful University network, can quickly create terrible traffic load.
> >>
> >> Many of the scripts we saw
> >>
> >> - ignored robots.txt,
> >> - ignored clear crawling speed limitations in robots.txt,
> >> - did not identify themselves properly in the HTTP request header or
> lacked
> >> contact information therein,
> >> - used no mechanisms at all for limiting the default crawling speed and
> >> re-crawling delays.
> >>
> >> This irresponsible behavior can be the final reason for site-owners to
> say
> >> farewell to academic/W3C-sponsored semantic technology.
> >>
> >> So please, please - advise all of your colleagues and students to NOT
> write
> >> simple crawler scripts for the billion triples challenge or whatsoever
> >> without familiarizing themselves with the state of the art in "friendly
> >> crawling".
> >>
> >> Best wishes
> >>
> >> Martin Hepp
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248
> >>
> >> You should read the ECS Web Team blog:
> http://blogs.ecs.soton.ac.uk/webteam/
> >>
>
>
>


-- 
Lin Clark
DERI, NUI Galway <http://www.deri.ie/>

lin-clark.com
twitter.com/linclark
Received on Wednesday, 22 June 2011 11:32:28 UTC