Re: Think before you write Semantic Web crawlers from Martin Hepp on 2011-06-23 (public-lod@w3.org from June 2011)

From: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Date: Thu, 23 Jun 2011 10:09:25 +0200
To: Kingsley Idehen <kidehen@openlinksw.com>
Cc: public-lod@w3.org
Message-Id: <FAA337A2-B44B-48B3-A922-716A046A45F7@ebusiness-unibw.org>

Yes, WebID is out of question a good thing. I am not entirely sure, though, that you can make it a mandatory requirement for access to your site, because if a few major consumers do not use WebID for their crawlers, site-owners cannot block anonymous crawlers.
On Jun 22, 2011, at 9:10 PM, Kingsley Idehen wrote:

> On 6/22/11 8:05 PM, Martin Hepp wrote:
>> Glenn:
>> 
>>> If there isn't, why not? We're the Semantic Web, dammit. If we aren't the masters of data interoperability, what are we?
>> The main question is: Is the Semantic Web an evolutionary improvement of the Web, the Web understood as an ecosystem comprising protocols, data models, people, and economics - or is it a tiny special interest branch.
>> 
>> As said: I bet a bottle of champagne that the academic Semantic Web community's technical proposals will never gain more than 10 % market share among "real" site-owners, because of
>> - unnecessary complexity (think of the simplicity of publishing an HTML page vs. following LOD publishing principles),
>> - bad design decisions (e.g explicit datatyping of data instances in RDFa),
>> - poor documentation for non-geeks, and
>> - a lack of understanding of the economics of technology diffusion.
> 
> Hoping you don't place WebID in the academic adventure bucket, right?
> 
> WebID, like URI abstraction, is well thought out critical infrastructure tech.
> 
> Kingsley
>> Never ever.
>> 
>> Best
>> 
>> Martin
>> 
>> On Jun 22, 2011, at 3:18 PM, glenn mcdonald wrote:
>> 
>>> > From my perspective as the designer of a system that both consumes and publishes data, the load/burden issue here is not at all particular to the semantic web. Needle obeys robots.txt rules, but that's a small deal compared to the difficulty of extracting whole data from sites set up to deliver it only in tiny pieces. I'd say about 98% of the time I can describe the data I want from a site with a single conceptual query. Indeed, once I've got the data into Needle I can almost always actually produce that query. But on the source site, I usually can't, and thus we are forced to waste everybody's time navigating the machines through superfluous presentation rendering designed for people. 10-at-a-time results lists, interminable AJAX refreshes, animated DIV reveals, grafting back together the splintered bits of tree-traversals, etc. This is all absurdly unnecessary. Why is anybody having to "crawl" an open semantic-web dataset? Isn't there a "download" link, and/or a SPARQL endpoint? If there isn't, why not? We're the Semantic Web, dammit. If we aren't the masters of data interoperability, what are we?
>>> 
>>> glenn
>>> (www.needlebase.com)
>> 
>> 
> 
> 
> -- 
> 
> Regards,
> 
> Kingsley Idehen	
> President&  CEO
> OpenLink Software
> Web: http://www.openlinksw.com
> Weblog: http://www.openlinksw.com/blog/~kidehen
> Twitter/Identi.ca: kidehen
> 
> 
> 
> 
> 
> 
>

Received on Thursday, 23 June 2011 08:10:01 UTC