Re: Think before you write Semantic Web crawlers from Martin Hepp on 2011-06-23 (public-lod@w3.org from June 2011)

From: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Date: Thu, 23 Jun 2011 10:30:27 +0200
To: Sebastian Schaffert <sebastian.schaffert@salzburgresearch.at>
Cc: Lin Clark <lin.w.clark@gmail.com>, public-lod <public-lod@w3.org>
Message-Id: <B9711FC2-DF8F-48CE-B68E-6AFF34B15928@ebusiness-unibw.org>

Sebastian, all:
The community may not publicly admit it, but: SW and LOD have been BEGGING for adoption for almost a decade. Now, if someone outside of a University project publishes valuable RDF data in a well-above-the-standards way, you make him pay several hundred Euros for traffic just for your ISWC paper.

Quote from one e-mail I received: "I think we are among the Universities running such a crawler. Because we were in a rush for the ISWC deadline, nobody took the time to implement robots.txt and bandwidth throttling. Sorry."

Stop dreaming. A technical improvement for the WWW cannot be developed in isolation from the socio-economic environment. I.e., it will lead to nowhere to just work on technical solutions that don't fit the characteristics of the target eco-system, skill-wise, incentive-wise, or complexity-wise, and then wait for the world to pick it up. Unless you want your work to be listed here

    http://www.mclol.com/funny-articles/most-useless-inventions-ever/

WebID is a notable exception, because it takes into account exactly those dimensions.

> And what if in the future 100.000 software agents will access servers? We will have the scalability issue eventually even without crawlers, so let's try to solve it. In the eyeball web, there are also crawlers without too much of a problem, and if Linked Data is to be successful we need to do the same.

How do you personally solve the scalability issue for small site-owners who are running a decent service from a basic understanding of HTML, PHP, and MySQL?

Best
Martin

On Jun 23, 2011, at 1:08 AM, Sebastian Schaffert wrote:

> 
> Am 22.06.2011 um 23:01 schrieb Lin Clark:
> 
>> On Wed, Jun 22, 2011 at 9:33 PM, Sebastian Schaffert <sebastian.schaffert@salzburgresearch.at> wrote:
>> 
>> Your complaint sounds to me a bit like "help, too many clients access my data".
>> 
>> I'm sure that Martin is really tired of saying this, so I will reiterate for him: It wasn't his data, they weren't his servers. He's speaking on behalf of people who aren't part of our insular community... people who don't have a compelling reason to subsidize a PhD student's Best Paper award with their own dollars and bandwidth.
> 
> And what about those companies subsidizing PhD students who write crawlers for the normal Web? Like Larry Page in 1998?
> 
>> 
>> Agents can use Linked Data just fine without firing 150 requests per second at a server. There are TONs of use cases that do not require that kind of server load.
> 
> And what if in the future 100.000 software agents will access servers? We will have the scalability issue eventually even without crawlers, so let's try to solve it. In the eyeball web, there are also crawlers without too much of a problem, and if Linked Data is to be successful we need to do the same.
> 
> Greetings,
> 
> Sebastian
> -- 
> | Dr. Sebastian Schaffert          sebastian.schaffert@salzburgresearch.at
> | Salzburg Research Forschungsgesellschaft  http://www.salzburgresearch.at
> | Head of Knowledge and Media Technologies Group          +43 662 2288 423
> | Jakob-Haringer Strasse 5/II
> | A-5020 Salzburg
>

Received on Thursday, 23 June 2011 08:30:54 UTC