W3C home > Mailing lists > Public > semantic-web@w3.org > June 2011

Re: Think before you write Semantic Web crawlers

From: glenn mcdonald <glenn@furia.com>
Date: Wed, 22 Jun 2011 09:18:28 -0400
Message-ID: <BANLkTinGjRFOkd4Hufz9=-pPdYbO5gUoTg@mail.gmail.com>
To: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Cc: Yves Raimond <yves.raimond@gmail.com>, Christopher Gutteridge <cjg@ecs.soton.ac.uk>, Daniel Herzig <herzig@kit.edu>, semantic-web@w3.org, public-lod@w3.org
>From my perspective as the designer of a system that both consumes and
publishes data, the load/burden issue here is not at all particular to the
semantic web. Needle obeys robots.txt rules, but that's a small deal
compared to the difficulty of extracting whole data from sites set up to
deliver it only in tiny pieces. I'd say about 98% of the time I can describe
the data I want from a site with a single conceptual query. Indeed, once
I've got the data into Needle I can almost always actually produce that
query. But on the source site, I usually can't, and thus we are forced to
waste everybody's time navigating the machines through superfluous
presentation rendering designed for people. 10-at-a-time results lists,
interminable AJAX refreshes, animated DIV reveals, grafting back together
the splintered bits of tree-traversals, etc. This is all absurdly
unnecessary. Why is anybody having to "crawl" an open semantic-web dataset?
Isn't there a "download" link, and/or a SPARQL endpoint? If there isn't, why
not? We're the Semantic Web, dammit. If we aren't the masters of data
interoperability, what are we?

Received on Wednesday, 22 June 2011 13:19:18 UTC

This archive was generated by hypermail 2.4.0 : Tuesday, 5 July 2022 08:45:25 UTC