W3C home > Mailing lists > Public > semantic-web@w3.org > June 2011

Re: Think before you write Semantic Web crawlers

From: Henry Story <henry.story@bblfish.net>
Date: Wed, 22 Jun 2011 22:17:14 +0200
Cc: glenn mcdonald <glenn@furia.com>, Yves Raimond <yves.raimond@gmail.com>, Christopher Gutteridge <cjg@ecs.soton.ac.uk>, Daniel Herzig <herzig@kit.edu>, semantic-web@w3.org, public-lod@w3.org
Message-Id: <BA1C3031-A83E-4087-861A-B6421C49D44D@bblfish.net>
To: Martin Hepp <martin.hepp@ebusiness-unibw.org>

On 22 Jun 2011, at 21:05, Martin Hepp wrote:

> Glenn:
>> If there isn't, why not? We're the Semantic Web, dammit. If we aren't the masters of data interoperability, what are we?
> The main question is: Is the Semantic Web an evolutionary improvement of the Web, the Web understood as an ecosystem comprising protocols, data models, people, and economics - or is it a tiny special interest branch.
> As said: I bet a bottle of champagne that the academic Semantic Web community's technical proposals will never gain more than 10 % market share among "real" site-owners, because of

I worked for AltaVista and Sun Microsystems, so I am not an academic.  And it would be difficult to get back to academia, as salaries are so low there. So we should be thankful at how much good work these people are putting into this for love of the subject. 

> - unnecessary complexity (think of the simplicity of publishing an HTML page vs. following LOD publishing principles),

Well, data manipulation is more difficult of course than simple web pages. But there are large benefits to be gained from more structured data. In the academic/buisness nonsense, you should look at how much IBM and co put into SOAP, and where that got them. Pretty much nowhere. The semantic web seems a lot more fruitful than SOAP to me, and has a lot more potential. It is not that difficult, it's just that people - in mass - are slow learners. But you know there is time.

> - bad design decisions (e.g explicit datatyping of data instances in RDFa),
> - poor documentation for non-geeks, and
> - a lack of understanding of the economics of technology diffusion.

Technology diffuses a lot slower than people think. But in aggregate it diffuses a lot faster than we can cope with.
  - "history of technology adoption" http://bblfish.net/blog/page1.html#14
  - "was moore's law inevitable" http://www.kk.org/thetechnium/archives/2009/07/was_moores_law.php

In any case WebID is so mindbogglingly simple, it falsifies all the points above. You have a problem and there is a solution to it. Of course we need to stop bad crawling. But also we should start showing how the web can protect itself, without asking just for good will. 


> Never ever.
> Best
> Martin
> On Jun 22, 2011, at 3:18 PM, glenn mcdonald wrote:
>>> From my perspective as the designer of a system that both consumes and publishes data, the load/burden issue here is not at all particular to the semantic web. Needle obeys robots.txt rules, but that's a small deal compared to the difficulty of extracting whole data from sites set up to deliver it only in tiny pieces. I'd say about 98% of the time I can describe the data I want from a site with a single conceptual query. Indeed, once I've got the data into Needle I can almost always actually produce that query. But on the source site, I usually can't, and thus we are forced to waste everybody's time navigating the machines through superfluous presentation rendering designed for people. 10-at-a-time results lists, interminable AJAX refreshes, animated DIV reveals, grafting back together the splintered bits of tree-traversals, etc. This is all absurdly unnecessary. Why is anybody having to "crawl" an open semantic-web dataset? Isn't there a "download" link, and/or a SPARQL endpoint? If there isn't, why not? We're the Semantic Web, dammit. If we aren't the masters of data interoperability, what are we?
>> glenn
>> (www.needlebase.com)

Social Web Architect
Received on Wednesday, 22 June 2011 20:17:50 UTC

This archive was generated by hypermail 2.4.0 : Tuesday, 5 July 2022 08:45:25 UTC