Re: Think before you write Semantic Web crawlers from Henry Story on 2011-06-21 (semantic-web@w3.org from June 2011)

From: Henry Story <henry.story@bblfish.net>
Date: Tue, 21 Jun 2011 12:10:45 +0200
To: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Cc: Christopher Gutteridge <cjg@ecs.soton.ac.uk>, semantic-web@w3.org, public-lod@w3.org
Message-Id: <439FD63B-06EE-4C62-AEC5-B36EBC067DA2@bblfish.net>
On 21 Jun 2011, at 11:44, Martin Hepp wrote:

> Hi Christoper, Henry, all:
> 
> The main problem is, imho:
> 
> 1. the basic attitude of Semantic Web research that the works done in the past or in other communities were irrelevant historical relicts (databases, middleware, EDI) and that the old fellows were simply too stupid to understand the "power of semantics that will make machines understand our data with ease", just by adding a bit of OWL 2 DL axioms, properly dereferencing data entity URIs according to their nice data publishing guidelines that turn toy examples into a magic art;
> 2. this implanted into the heads of eager young people who excelled in the "AI for freshmen", "complexity theory", and "theorem proving" exams and who now apply the gained self-confidence from a small subset of life to a broader range of fields, and
> 3. allocating a lot of money (EU funding) and an abundance of IT resources (university servers, bandwidth,...) to those folks.
> 
> This mindset is the petri dish for stupid crawlers as described.

Having worked at AltaVista I will strongly confirm your point on the importance of writing polite crawlers. Search engines that don't can soon get completely blocked from large parts of the internet, and so loose their ability to do business.

> Unfortunately, authentication techniques won't help protecting typical site-owners from the dangerous creatures written by Semantic Web researchers gathering data for the evaluation of their ISWC 2011 submission, because the site-owners at www.godaddy.com know nothing about WebID at this point ;-)

Yes, but if a few of the data set owners put up WebID authenticated endpoints then 2 things will happen:

1. It will be possible for the sites that do to report back on the crawling behaviour of the crawlers, and you can use this when reporting back to the funding boards of these research projects
2. It will be possible for large data sites to reduce bandwidth of crawlers by guiding them to a dump. (of course one still wants to allow polite crawlers to download individual pages so as to avoid them having to download the whole dump whenever they want to verify a small change) (RSS feeds are useful here)

From my point of view of course this also helps the semweb community deploy WebID, a powerful authentication technique we can use to develop the most viral part of the semantic web to come: the social web. In the end every server will become a mini crawler, and so there will be portentially a lot of this type of abuse. Where initially search engines could work with Gentleman's agreements of politeness, we are at a stage as you point out, where more and more people are getting access to crawler technologies, or the ability to write them easily. As a result we need to focus on preventive technologies. WebID just works out of the box with existing tech and ties right into linked data. We can use these tools at our disposal so we don't have to rely on good will, but can enforce good behavior.


	Henry
  


> 
> Martin
> 
> PS: I will not release the IP ranges from which the trouble originated, but rest assured, there were top research institutions among them.
> 
> 
> On Jun 21, 2011, at 10:48 AM, Christopher Gutteridge wrote:
> 
>> Would some kind of caching crawler mitigate this issue? Have someone write a well behaved crawler which allowed you to download a recent .ttl.tgz of various sites. Of course, that assumes the student is able to find such a cache.
>> 
>> Asking people nicely will only work in a very small community.
>> 
>> Henry Story wrote:
>>> A solution to stupid crawlers would be to put the linked data behind https endpoints, and use WebID 
>>> for authentication. You could still allow everyone access, but at least you would force the crawler to identify 
>>> himself, and use these WebIDs to learn who was making the crawler. This could then be used as a piece of the evaluation of the quality of a semantic web stack.
>>> 
>>> Henry
>>> 
>>> 10 minute intro to WebID 
>>> http://bblfish.net/blog/2011/05/25/
>>> (in browsers, but the browser is not really necessary)
>>> 
>>> On 21 Jun 2011, at 09:49, Martin Hepp wrote:
>>> 
>>> 
>>> 
>>>> Hi all:
>>>> 
>>>> For the third time in a few weeks, we had massive complaints from site-owners that Semantic Web crawlers from Universities visited their sites in a way close to a denial-of-service attack, i.e., crawling data with maximum bandwidth in a parallelized approach.
>>>> 
>>>> It's clear that a single, stupidly written crawler script, run from a powerful University network, can quickly create terrible traffic load. 
>>>> 
>>>> Many of the scripts we saw
>>>> 
>>>> - ignored robots.txt,
>>>> - ignored clear crawling speed limitations in robots.txt,
>>>> - did not identify themselves properly in the HTTP request header or lacked contact information therein, 
>>>> - used no mechanisms at all for limiting the default crawling speed and re-crawling delays.
>>>> 
>>>> This irresponsible behavior can be the final reason for site-owners to say farewell to academic/W3C-sponsored semantic technology.
>>>> 
>>>> So please, please - advise all of your colleagues and students to NOT write simple crawler scripts for the billion triples challenge or whatsoever without familiarizing themselves with the state of the art in "friendly crawling".
>>>> 
>>>> Best wishes
>>>> 
>>>> Martin Hepp
>>>> 
>>>> 
>>>> 
>>> 
>>> Social Web Architect
>>> 
>>> http://bblfish.net/
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> -- 
>> Christopher Gutteridge -- 
>> http://id.ecs.soton.ac.uk/person/1248
>> 
>> 
>> You should read the ECS Web Team blog: 
>> http://blogs.ecs.soton.ac.uk/webteam/
> 

Social Web Architect
http://bblfish.net/
Received on Tuesday, 21 June 2011 10:11:19 UTC