Just what *does* robots.txt mean for a LOD site?

Hi.

I’m pretty sure this discussion suggest that we (the LD community) should come try to come to some consensus of policy on exactly what it means if an agent finds a robots.txt on a Linked Data site.

So I have changed the subject line - sorry Chris, it should have been changed earlier.

Not an easy thing to come to, I suspect, but it seems to have become significant.
Is there a more official forum for this sort of thing?

On 26 Jul 2014, at 00:55, Luca Matteis <lmatteis@gmail.com> wrote:

> On Sat, Jul 26, 2014 at 1:34 AM, Hugh Glaser <hugh@glasers.org> wrote:
>> That sort of sums up what I want.
> 
> Indeed. So I agree that robots.txt should probably not establish
> whether something is a linked dataset or not. To me your data is still
> linked data even though robots.txt is blocking access of specific
> types of agents, such as crawlers.
> 
> Aidan,
> 
>> *) a Linked Dataset behind a robots.txt blacklist is not a Linked Dataset.
> 
> Isn't that a bit harsh? That would be the case if the only type of
> agent is a crawler. But as Hugh mentioned, linked datasets can be
> useful simply by treating URIs as dereferenceable identifiers without
> following links.
In Aidan’s view (I hope I am right here), it is perfectly sensible.
If you start from the premise that robots.txt is intended to prohibit access be anything other than a browser with a human at it, then only humans could fetch the RDF documents.
Which means that the RDF document is completely useless as a machine-interpretable semantics for the resource, since it would need a human to do some cut and paste or something to get it into a processor.

It isn’t really a question of harsh - it is perfectly logical from that view of robots.txt (which isn’t our view, because we think that robots.txt is about "specific types of agents”, as you say).

Cheers
Hugh

-- 
Hugh Glaser
   20 Portchester Rise
   Eastleigh
   SO50 4QS
Mobile: +44 75 9533 4155, Home: +44 23 8061 5652

Received on Saturday, 26 July 2014 11:18:06 UTC