Re: Updated LOD Cloud Diagram - Missed data sources. from Hugh Glaser on 2014-07-25 (public-lod@w3.org from July 2014)

From: Hugh Glaser <hugh@glasers.org>
Date: Fri, 25 Jul 2014 20:54:06 +0100
To: ahogan@dcc.uchile.cl
Cc: public-lod@w3.org
Message-Id: <4C619DDC-2F1D-4CAB-8C87-BB60FCF1EAA1@glasers.org>
Very interesting.
On 25 Jul 2014, at 20:12, ahogan@dcc.uchile.cl wrote:

> On 25/07/2014 14:44, Hugh Glaser wrote:
>> The idea that having a robots.txt that Disallows spiders
>> is a “problem” for a dataset is rather bizarre.
>> It is of course a problem for the spider, but is clearly not a problem
> for a
>> typical consumer of the dataset.
>> By that measure, serious numbers of the web sites we all use on a daily
>> basis are problematic.
> <snip>
> 
> I think the general interpretation of the robots in "robots.txt" is any
> software agent accessing the site "automatically" (versus a user manually
> entering a URL).
I had never thought this.
My understanding of the agents that should respect the robots.txt is what are usually called crawlers or spiders.
Primarily search engines, but also including things that aim to automatically get a whole junk of a site.
Of course, there is no de jure standard, but the places I look seem to lean to my view.
http://www.robotstxt.org/orig.html
"WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.”
https://en.wikipedia.org/wiki/Web_robot
"Typically, bots perform tasks that are both simple and structurally repetitive, at a much higher rate than would be possible for a human alone. “
It’s all about scale and query rate.
So a php script that fetches one URI now and then is not the target for the restriction - nor indeed is my shell script that daily fetches a common page I want to save on my laptop.

So, I confess, when my system trips over a dbpedia (or any other) URI and does follow-your-nose to get the RDF, it doesn’t check that the site robots.txt allows it.
And I certainly don’t expect Linked Data consumers doing simple URI resolution to check my robots.txt

But you are right, if I am wrong - robots.txt would make no sense in the Linked Data world, since pretty much by definition it will always be an agent doing the access.
But then I think we really need a convention (User-agent: ?) that lets me tell search engines to stay away, while allowing LD apps to access the stuff they want.

Best
Hugh
> 
> If we agree on that interpretation, a robots.txt blacklist prevents
> applications from following links to your site. In that case, my
> counter-question would be: what is the benefit of publishing your content
> as Linked Data (with dereferenceable URIs and rich links) if you
> subsequently prevent machines from discovering and accessing it
> automatically? Essentially you are requesting that humans (somehow) have
> to manually enter every URI/URL for every source, which is precisely the
> document-centric view we're trying to get away from.
> 
> Put simply, as far as I can see, a dereferenceable URI behind a robots.txt
> blacklist is no longer a dereferenceable URI ... at least for a respectful
> software agent. Linked Data behind a robots.txt blacklist is no longer
> Linked Data.
> 
> (This is quite clear in my mind but perhaps others might disagree.)
> 
> Best,
> Aidan
> 
> 

-- 
Hugh Glaser
   20 Portchester Rise
   Eastleigh
   SO50 4QS
Mobile: +44 75 9533 4155, Home: +44 23 8061 5652
Received on Friday, 25 July 2014 19:55:28 UTC