Re: Updated LOD Cloud Diagram - Missed data sources. from ahogan@dcc.uchile.cl on 2014-07-25 (public-lod@w3.org from July 2014)

From: <ahogan@dcc.uchile.cl>
Date: Fri, 25 Jul 2014 15:12:48 -0400
To: public-lod@w3.org
Message-ID: <fc11306d49b93d689615677edc3857c0.squirrel@webmail.dcc.uchile.cl>

On 25/07/2014 14:44, Hugh Glaser wrote:
> The idea that having a robots.txt that Disallows spiders
> is a “problem” for a dataset is rather bizarre.
> It is of course a problem for the spider, but is clearly not a problem
for a
> typical consumer of the dataset.
> By that measure, serious numbers of the web sites we all use on a daily
> basis are problematic.
<snip>

I think the general interpretation of the robots in "robots.txt" is any
software agent accessing the site "automatically" (versus a user manually
entering a URL).

If we agree on that interpretation, a robots.txt blacklist prevents
applications from following links to your site. In that case, my
counter-question would be: what is the benefit of publishing your content
as Linked Data (with dereferenceable URIs and rich links) if you
subsequently prevent machines from discovering and accessing it
automatically? Essentially you are requesting that humans (somehow) have
to manually enter every URI/URL for every source, which is precisely the
document-centric view we're trying to get away from.

Put simply, as far as I can see, a dereferenceable URI behind a robots.txt
blacklist is no longer a dereferenceable URI ... at least for a respectful
software agent. Linked Data behind a robots.txt blacklist is no longer
Linked Data.

(This is quite clear in my mind but perhaps others might disagree.)

Best,
Aidan

Received on Friday, 25 July 2014 19:13:12 UTC