Re: Think before you write Semantic Web crawlers from Henry Story on 2011-06-22 (public-lod@w3.org from June 2011)

From: Henry Story <henry.story@bblfish.net>
Date: Wed, 22 Jun 2011 14:18:49 +0200
To: Lin Clark <lin.w.clark@gmail.com>
Cc: adam.saltiel@gmail.com, Martin Hepp <martin.hepp@ebusiness-unibw.org>, semantic-web-request@w3.org, Yves Raimond <yves.raimond@gmail.com>, Christopher Gutteridge <cjg@ecs.soton.ac.uk>, Daniel Herzig <herzig@kit.edu>, semantic-web@w3.org, public-lod@w3.org
Message-Id: <7EEA0AEC-6F74-42B9-A816-8DE6E9BE1741@bblfish.net>
On 22 Jun 2011, at 13:31, Lin Clark wrote:

> I was with you on this until the cathedral and the bazaar thing...

Yes, I think the metaphors there have ended up getting cross-wired.  The paper on the Cathedral and the Bazaar was a very good paper, and made in a world that thought that only centralised ways of thinking could build good software. It really helped spread an old idea of peer to peer development.

Peer to peer is great, but it brings with it the potential of viruses and various other problems. That is why even in peer 2 peer software development, you allow people to fork a project, but not necessarily to write to your repository. The Web is peer to peer (the bazaar) but you don't allow everyone to write to your home page. One could think of the web as a number of cathedrals linked up in a peer to peer fashion. A bazaar of cathedrals if you wish.

It is this diversity of peers that makes the richness of the web. This diversity is guaranteed by the protection each site has from being attacked, and the guarantee therefore that each site expresses a unique point of view. 

So until recently crawlers were few and far between because the computing resources just cost so much, that only specialised engineers wrote crawlers. At AltaVista the crawler was written initially by Louis Monier in 1996 then later by Spiderman. Spiderman was on the project for years, and his was carefully tested and reviewed. The DEC alpha machines at the times were 500Mhz 64 bit Alpha computers with 8GB of RAM and cost a fortune. DEC was selling clusters of 8 of those together. You had to be very rich to get the bandwidth. 

Now every laptop has 8GB of RAM, 4 cores at 2.3Ghz, and every household has amazing bandwidth to the internet. So silly programs are going to become more prevalent. Going on conventions such as robots.txt files placed at a conventional location as described by some spec written out somewhere on the internet is not going to work in this new world. Neither is it really going to help to look at HTTP headers and other such conventional methods.

You need to move to strong defences. This is what WebID provides very efficiently. Each resource can ask the requestor for their identity before giving access to a resource. It is completely decentralised and about as efficient as one can get. 
So just as the power of computing has grown for everyone to write silly software so TLS and https has become cheaper and cheaper. Google is now moving to put all its servers behind https and so is Facebook. Soon all the web will be behind https - and that will massively increase the security on the whole web.

Henry

Many papers and implementations of WebID are here http://esw.w3.org/foaf+ssl
Also please join the WebID incubator group at the W3C http://www.w3.org/2005/Incubator/webid/charter

> I think it is a serious misreading of cathedral and bazaar to think that if something is naive and irresponsible, it is by definition bazaar style development. Bazaar style is about how code is developed (in the open by a loosely organized and fluctuating group of developers). Cathedral means that it is a smaller, generally hierarchically organized group which doesn't work in a public, open way between releases. There is a good summary on wikipedia, http://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar
> 
> The bazaar style of development can lead to things that are more responsible than their cathedral counterparts. Bazaar means continuously documenting your decisions in the public, posting patches for testing and review by everyone (not just your extremely busy team mates), and opening your dev process to co-developers who you don't already know. These organizational strategies have lead to some REALLY BIG engineering wins... and these engineering wins have resulted in more responsible products than their cathedral-built counterparts.
> 
> I also would question the assertion that people want cathedrals... the general direction on the Web seems to be away from cathedrals like Microsoft and Flash and towards bazaar developed solutions.
> 
> However, the call to responsibility is still a very valid one. I'm quite sorry to hear that a large data publisher has been pushed out of the community effort by people who should be working on the same team. 
> 
> -Lin
> 
> 
> On Wed, Jun 22, 2011 at 11:59 AM, <adam.saltiel@gmail.com> wrote:
> Yes. But are there things such as Squid and WebId that can be instituted the provider side? This is an interesting moment. Is it the academic SemWeb running out of public facing steam. A retreat. Or is it a moment of transition from naivety to responsibility. When we think about the Cathedral and the Bazaar. There is a reason why people want Cathedrals. I suggest SemWeb is about Cathedrals. Responsibility for some order and structure.
> 
> Adam
> Sent using BlackBerry® from Orange
> 
> -----Original Message-----
> From: Martin Hepp <martin.hepp@ebusiness-unibw.org>
> Sender: semantic-web-request@w3.org
> Date: Wed, 22 Jun 2011 11:42:58
> To: Yves Raimond<yves.raimond@gmail.com>
> Cc: Christopher Gutteridge<cjg@ecs.soton.ac.uk>; Daniel Herzig<herzig@kit.edu>; <semantic-web@w3.org>; <public-lod@w3.org>
> Subject: Re: Think before you write Semantic Web crawlers
> 
> Just to inform the community that the BTC / research crawlers have been successful in killing a major RDF source for e-commerce:
> 
> OpenEAN - a transcript of >1 Mio product models and their EAN/UPC code at http://openean.kaufkauf.net/id/ has been permanently shut down by the site operator because fighting with bad semweb crawlers is taking too much of his time.
> 
> Thanks a lot for everybody who contributed to that. It trashes a month of work and many million useful triples.
> 
> Best
> 
> Martin Hepp
> 
> 
> 
> On Jun 22, 2011, at 11:37 AM, Yves Raimond wrote:
> 
> > Hello!
> >
> >> The difference between these two scenarios is that there's almost no CPU
> >> involvement in serving the PDF file, but naive RDF sites use lots of cycles
> >> to generate the response to a query for an RDF document.
> >>
> >> Right now queries to data.southampton.ac.uk (eg.
> >> http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made
> >> live, but this is not efficient. My colleague, Dave Challis, has prepared a
> >> SPARQL endpoint which caches results which we can turn on if the load gets
> >> too high, which should at least mitigate the problem. Very few datasets
> >> change in a 24 hours period.
> >
> > Hmm, I would strongly argue it is not the case (and stale datasets are
> > a bit issue in LOD imho!). The data on the BBC website, for example,
> > changes approximately 10 times a second.
> >
> > We've also been hit in the past (and still now, to a lesser extent) by
> > badly behaving crawlers. I agree that, as we don't provide dumps, it
> > is the only way to generate an aggregation of BBC data, but we've had
> > downtime in the past caused by crawlers. After that happened, it
> > caused lots of discussions on whether we should publish RDF data at
> > all (thankfully, we succeeded to argue that we should keep it - but
> > that's a lot of time spent arguing instead of publishing new juicy RDF
> > data!)
> >
> > I also want to point out (in response to Andreas's email) that HTTP
> > caches are *completely* inefficient to protect a dataset against that,
> > as crawlers tend to be exhaustive. ETags and Expiry headers are
> > helpful, but chances are that 1) you don't know when the data will
> > change, you can just make a wild guess based on previous behavior 2)
> > the cache would have expired the time the crawler requests a document
> > a second time, as it has ~100M (in our case) documents to crawl
> > through.
> >
> > Request throttling would work, but you would have to find a way to
> > identify crawlers, which is tricky: most of them use multiple IPs and
> > don't set appropriate user agents (the crawlers that currently hit us
> > the most are wget and Java 1.6 :/ ).
> >
> > So overall, there is no excuse for badly behaving crawlers!
> >
> > Cheers,
> > y
> >
> >>
> >> Martin Hepp wrote:
> >>
> >> Hi Daniel,
> >> Thanks for the link! I will relay this to relevant site-owners.
> >>
> >> However, I still challenge Andreas' statement that the site-owners are to
> >> blame for publishing large amounts of data on small servers.
> >>
> >> One can publish 10,000 PDF documents on a tiny server without being hit by
> >> DoS-style crazy crawlers. Why should the same not hold if I publish RDF?
> >>
> >> But for sure, it is necessary to advise all publishers of large RDF datasets
> >> to protect themselves against hungry crawlers and actual DoS attacks.
> >>
> >> Imagine if a large site was brought down by a botnet that is exploiting
> >> Semantic Sitemap information for DoS attacks, focussing on the large dump
> >> files.
> >> This could end LOD experiments for that site.
> >>
> >>
> >> Best
> >>
> >> Martin
> >>
> >>
> >> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:
> >>
> >>
> >>
> >> Hi Martin,
> >>
> >> Have you tried to put a Squid [1]  as reverse proxy in front of your servers
> >> and use delay pools [2] to catch hungry crawlers?
> >>
> >> Cheers,
> >> Daniel
> >>
> >> [1] http://www.squid-cache.org/
> >> [2] http://wiki.squid-cache.org/Features/DelayPools
> >>
> >> On 21.06.2011, at 09:49, Martin Hepp wrote:
> >>
> >>
> >>
> >> Hi all:
> >>
> >> For the third time in a few weeks, we had massive complaints from
> >> site-owners that Semantic Web crawlers from Universities visited their sites
> >> in a way close to a denial-of-service attack, i.e., crawling data with
> >> maximum bandwidth in a parallelized approach.
> >>
> >> It's clear that a single, stupidly written crawler script, run from a
> >> powerful University network, can quickly create terrible traffic load.
> >>
> >> Many of the scripts we saw
> >>
> >> - ignored robots.txt,
> >> - ignored clear crawling speed limitations in robots.txt,
> >> - did not identify themselves properly in the HTTP request header or lacked
> >> contact information therein,
> >> - used no mechanisms at all for limiting the default crawling speed and
> >> re-crawling delays.
> >>
> >> This irresponsible behavior can be the final reason for site-owners to say
> >> farewell to academic/W3C-sponsored semantic technology.
> >>
> >> So please, please - advise all of your colleagues and students to NOT write
> >> simple crawler scripts for the billion triples challenge or whatsoever
> >> without familiarizing themselves with the state of the art in "friendly
> >> crawling".
> >>
> >> Best wishes
> >>
> >> Martin Hepp
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248
> >>
> >> You should read the ECS Web Team blog: http://blogs.ecs.soton.ac.uk/webteam/
> >>
> 
> 
> 
> 
> 
> -- 
> Lin Clark
> DERI, NUI Galway
> 
> lin-clark.com
> twitter.com/linclark
> 

Social Web Architect
http://bblfish.net/
Received on Wednesday, 22 June 2011 12:19:36 UTC