Re: Updated LOD Cloud Diagram - Missed data sources. from ahogan@dcc.uchile.cl on 2014-07-25 (public-lod@w3.org from July 2014)

From: <ahogan@dcc.uchile.cl>
Date: Fri, 25 Jul 2014 17:13:51 -0400
To: public-lod@w3.org
Message-ID: <5fc1267cb4fc0024fa32d183fde6c9f5.squirrel@webmail.dcc.uchile.cl>
On 25/07/2014 15:54, Hugh Glaser wrote:> Very interesting.
> On 25 Jul 2014, at 20:12, ahogan@dcc.uchile.cl wrote:
>
>> On 25/07/2014 14:44, Hugh Glaser wrote:
>>> The idea that having a robots.txt that Disallows spiders
>>> is a “problem” for a dataset is rather bizarre.
>>> It is of course a problem for the spider, but is clearly not a problem
>> for a
>>> typical consumer of the dataset.
>>> By that measure, serious numbers of the web sites we all use on a daily
>>> basis are problematic.
>> <snip>
>>
>> I think the general interpretation of the robots in "robots.txt" is any
>> software agent accessing the site "automatically" (versus a user manually
>> entering a URL).
> I had never thought this.
> My understanding of the agents that should respect the robots.txt is what
> are usually called crawlers or spiders.
> Primarily search engines, but also including things that aim to
> automatically get a whole junk of a site.
> Of course, there is no de jure standard, but the places I look seem to
> lean to my view.
> http://www.robotstxt.org/orig.html
> "WWW Robots (also called wanderers or spiders) are programs that traverse
> many pages in the World Wide Web by recursively retrieving linked pages.”
> https://en.wikipedia.org/wiki/Web_robot
> "Typically, bots perform tasks that are both simple and structurally
> repetitive, at a much higher rate than would be possible for a human
> alone. “
> It’s all about scale and query rate.
> So a php script that fetches one URI now and then is not the target for
> the restriction - nor indeed is my shell script that daily fetches a
> common page I want to save on my laptop.
>
> So, I confess, when my system trips over a dbpedia (or any other) URI and
> does follow-your-nose to get the RDF, it doesn’t check that the site
> robots.txt allows it.
> And I certainly don’t expect Linked Data consumers doing simple URI
> resolution to check my robots.txt
>
> But you are right, if I am wrong - robots.txt would make no sense in the
> Linked Data world, since pretty much by definition it will always be an
> agent doing the access.
> But then I think we really need a convention (User-agent: ?) that lets me
> tell search engines to stay away, while allowing LD apps to access the
> stuff they want.

Then it seems our core disagreement is on the notion of a robot, which is
indeed a grey area. With respect to robots only referring to
warehouses/search engines, this was indeed the primary use-case for
robots.txt, but for me it's just an instance of what robots.txt is used
for.

Rather than focus on what is a robot, I think it's important to look at
(some of the commonly quoted reasons) why people use robots.txt and what
the robots.txt requests:


"Charles Stross claims to have provoked Koster to suggest robots.txt,
after he wrote a badly-behaved web spider that caused an inadvertent
denial of service attack on Koster's server." [1]

Note that robots.txt has an optional Crawl-delay primitive. Other reasons:

"A robots.txt file on a website will function as a request that specified
robots ignore specified files or directories when crawling a site. This
might be, for example, out of a preference for privacy from search engine
results, or the belief that the content of the selected directories might
be misleading or irrelevant to the categorization of the site as a whole,
or out of a desire that an application only operate on certain data." [1]


So moving aside from the definition of a robot, more importantly, I think
a domain administrator has very good grounds to be annoyed if any software
agent (not just Google et al.) breaks the conditions requested in the
robots.txt: conditions specifying local/sensitive data or request delays.

Likewise, I think that a lot of Linked Data clients are liable to break
such conditions, especially those that follow links on the fly. Such LD
clients can cause, for example, DoS attacks by requesting lots of pages in
parallel in a very short time, or externalisation of site content that was
intended to be kept local.

Aside from the crawlers of Linked Data warehouses, LD clients that could
cause (D)DoS attacks include on-the-fly link traversal query execution
engines, or navigational systems, or browsers that gather additional
background information on-the-fly (like labels) from various surrounding
documents.

Other Linked Data clients may surface (blacklisted) data intended to be
kept local on other sites. This would include LD clients that present
blacklisted data from the site in question through an interface on another
site, or that integrates said data with data from another site.

For me, this captures pretty much all "non-trivial" Linked Data clients.


My reasoning is then as follows:

*) robots.txt should be respected by all software agents, not just
warehouse crawlers, and not just Google et al;
*) by their nature, robots.txt restrictions are relevant to most Linked
Data clients and a (responsible) Linked Data client should not breach the
stated conditions;
*) a robots.txt blacklist will completely stop (responsible) Linked Data
clients from being able to dereference URIs;
*) a Linked Dataset behind a robots.txt blacklist is not a Linked Dataset.

(There's maybe some shades of grey in there that I have not properly
represented, but I think the arguments hold for the most part while
avoiding the ambiguous question of "what is a robot?".)

Best,
Aidan

[1] http://en.wikipedia.org/wiki/Robots_exclusion_standard
Received on Friday, 25 July 2014 21:14:17 UTC