Re: Think before you write Semantic Web crawlers from glenn mcdonald on 2011-06-22 (semantic-web@w3.org from June 2011)

From: glenn mcdonald <glenn@furia.com>
Date: Wed, 22 Jun 2011 09:18:28 -0400
To: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Cc: Yves Raimond <yves.raimond@gmail.com>, Christopher Gutteridge <cjg@ecs.soton.ac.uk>, Daniel Herzig <herzig@kit.edu>, semantic-web@w3.org, public-lod@w3.org
Message-ID: <BANLkTinGjRFOkd4Hufz9=-pPdYbO5gUoTg@mail.gmail.com>

>From my perspective as the designer of a system that both consumes and
publishes data, the load/burden issue here is not at all particular to the
semantic web. Needle obeys robots.txt rules, but that's a small deal
compared to the difficulty of extracting whole data from sites set up to
deliver it only in tiny pieces. I'd say about 98% of the time I can describe
the data I want from a site with a single conceptual query. Indeed, once
I've got the data into Needle I can almost always actually produce that
query. But on the source site, I usually can't, and thus we are forced to
waste everybody's time navigating the machines through superfluous
presentation rendering designed for people. 10-at-a-time results lists,
interminable AJAX refreshes, animated DIV reveals, grafting back together
the splintered bits of tree-traversals, etc. This is all absurdly
unnecessary. Why is anybody having to "crawl" an open semantic-web dataset?
Isn't there a "download" link, and/or a SPARQL endpoint? If there isn't, why
not? We're the Semantic Web, dammit. If we aren't the masters of data
interoperability, what are we?

glenn
(www.needlebase.com)

Received on Wednesday, 22 June 2011 13:19:18 UTC