Re: Just what *does* robots.txt mean for a LOD site? from ahogan@dcc.uchile.cl on 2014-07-26 (public-lod@w3.org from July 2014)

From: <ahogan@dcc.uchile.cl>
Date: Sat, 26 Jul 2014 15:03:30 -0400
To: public-lod@w3.org
Message-ID: <5197bdbc75b08786eaa58c42585f138d.squirrel@webmail.dcc.uchile.cl>
Thanks Hugh for the subject change and the reasonable summary.

@Luca, per my previous emails, I think that a robots.txt blacklist should
affect a broad range of Linked Data agents, so much so that I would no
longer consider the URIs affected dereferenceable, and thus I would no
longer call the affected data Linked Data. I don't feel that "harsh" is
applicable ... but I guess there is room for discussion. :)

The difference in opinion remains to what extent Linked Data agents need
to pay attention to the robots.txt file.

As many others have suggested, I buy into the idea of any agent not
relying document-wise on user input being subject to robots.txt.


I should add that in your case Hugh, you can avoid problems while
considering more fine-grained controls in your robots.txt file. For
example, you can specifically ban Google/Yahoo/Yandex!/Bing agents, etc.,
from parts of your site using robots.txt. Likewise, if you are considered
about the use of resources, you can throttle agents using "Crawl-delay" (a
non-standard exception, but one that should be respected by the "big
agents"). You can set a crawl delay with respect to the costs you foresee
per request and the number of agents you see competing for resources.

Note also that even the big spiders like Google, Yahoo!, etc., are
unlikely to actually crawl very deep into your dataset unless you've a lot
of incoming links. Essentially your site as you describe sounds like a
part of the "Deep Web".

Best,
Aidan

On 26/07/2014 07:16, Hugh Glaser wrote:
> Hi.
>
> I’m pretty sure this discussion suggest that we (the LD community) should
> come try to come to some consensus of policy on exactly what it means if
> an agent finds a robots.txt on a Linked Data site.
>
> So I have changed the subject line - sorry Chris, it should have been
> changed earlier.
>
> Not an easy thing to come to, I suspect, but it seems to have become
> significant.
> Is there a more official forum for this sort of thing?
>
> On 26 Jul 2014, at 00:55, Luca Matteis <lmatteis@gmail.com> wrote:
>
>> On Sat, Jul 26, 2014 at 1:34 AM, Hugh Glaser <hugh@glasers.org> wrote:
>>> That sort of sums up what I want.
>>
>> Indeed. So I agree that robots.txt should probably not establish
>> whether something is a linked dataset or not. To me your data is still
>> linked data even though robots.txt is blocking access of specific
>> types of agents, such as crawlers.
>>
>> Aidan,
>>
>>> *) a Linked Dataset behind a robots.txt blacklist is not a Linked
> Dataset.
>>
>> Isn't that a bit harsh? That would be the case if the only type of
>> agent is a crawler. But as Hugh mentioned, linked datasets can be
>> useful simply by treating URIs as dereferenceable identifiers without
>> following links.
> In Aidan’s view (I hope I am right here), it is perfectly sensible.
> If you start from the premise that robots.txt is intended to prohibit
> access be anything other than a browser with a human at it, then only
> humans could fetch the RDF documents.
> Which means that the RDF document is completely useless as a machine-
> interpretable semantics for the resource, since it would need a human
> to do some cut and paste or something to get it into a processor.
>
> It isn’t really a question of harsh - it is perfectly logical from that
> view of robots.txt (which isn’t our view, because we think that robots.txt
> is about "specific types of agents”, as you say).
>
> Cheers
> Hugh
>
Received on Saturday, 26 July 2014 19:03:56 UTC