Re: Think before you write Semantic Web crawlers

Martin,

Am 23.06.2011 um 10:30 schrieb Martin Hepp:

> Sebastian, all:
> The community may not publicly admit it, but: SW and LOD have been BEGGING for adoption for almost a decade. Now, if someone outside of a University project publishes valuable RDF data in a well-above-the-standards way, you make him pay several hundred Euros for traffic just for your ISWC paper.

I am very well aware of the problem of adoption. At the same time, we have a similar problem not only in the publication of the data but also in the consumption: if we do not let users consume our data even in large scale, what use is the data at all? I agree that bombarding a server with crawlers just for harvesting as many triples as possible without thinking about their use is stupid. But it will always happen, no matter how many mails we have on the Linked Data mailinglist.

My argument is that even the useful applications that will build on top of Linked Data will eventually make the data providers pay real money for publishing their data. In the same way that it costs to publish a website in the eyeball Web. Now what is the difference? Probably that people nowadays immediately see that the money on a website is well spent, while they do not see how the money invested on Linked Data is well spent. Because of lack of compelling applications and data use. And now we have a circle: how are people going to implement compelling applications if they have no access to the data?

Btw, just for the record: for my ISWC paper I did not harvest the Web for RDF; instead we wrote a Linked Data server that eventually might contribute to the scalability problems and make it both easier and cheaper to publish Linked Data.

> 
> Quote from one e-mail I received: "I think we are among the Universities running such a crawler. Because we were in a rush for the ISWC deadline, nobody took the time to implement robots.txt and bandwidth throttling. Sorry."
> 
> Stop dreaming.

So stop doing research? ;-)

I am dreaming of providing to users the technology that allows them to publish their data as Linked Data easily without needing to care too much about the complex issues that come with Linked Data. Like scalability, like authentication, and like technical issues like bandwidth throttling (which can be equally implemented on the server).

> A technical improvement for the WWW cannot be developed in isolation from the socio-economic environment. I.e., it will lead to nowhere to just work on technical solutions that don't fit the characteristics of the target eco-system, skill-wise, incentive-wise, or complexity-wise, and then wait for the world to pick it up. Unless you want your work to be listed here
> 
>    http://www.mclol.com/funny-articles/most-useless-inventions-ever/
> 
> WebID is a notable exception, because it takes into account exactly those dimensions.
> 
>> And what if in the future 100.000 software agents will access servers? We will have the scalability issue eventually even without crawlers, so let's try to solve it. In the eyeball web, there are also crawlers without too much of a problem, and if Linked Data is to be successful we need to do the same.
> 
> How do you personally solve the scalability issue for small site-owners who are running a decent service from a basic understanding of HTML, PHP, and MySQL?


By providing a technology like the Apache Webserver (just for Linked Data) that you did not even mention because it is so obvious. They simply should not need to care at all, because we provide them with the right technology that takes away the current problems. We are working on that in Salzburg, and many others are working on that.


Greetings,

Sebastian
-- 
| Dr. Sebastian Schaffert          sebastian.schaffert@salzburgresearch.at
| Salzburg Research Forschungsgesellschaft  http://www.salzburgresearch.at
| Head of Knowledge and Media Technologies Group          +43 662 2288 423
| Jakob-Haringer Strasse 5/II
| A-5020 Salzburg

Received on Thursday, 23 June 2011 11:32:57 UTC