- From: <ahogan@dcc.uchile.cl>
- Date: Fri, 25 Jul 2014 17:13:51 -0400
- To: public-lod@w3.org
On 25/07/2014 15:54, Hugh Glaser wrote:> Very interesting. > On 25 Jul 2014, at 20:12, ahogan@dcc.uchile.cl wrote: > >> On 25/07/2014 14:44, Hugh Glaser wrote: >>> The idea that having a robots.txt that Disallows spiders >>> is a “problem” for a dataset is rather bizarre. >>> It is of course a problem for the spider, but is clearly not a problem >> for a >>> typical consumer of the dataset. >>> By that measure, serious numbers of the web sites we all use on a daily >>> basis are problematic. >> <snip> >> >> I think the general interpretation of the robots in "robots.txt" is any >> software agent accessing the site "automatically" (versus a user manually >> entering a URL). > I had never thought this. > My understanding of the agents that should respect the robots.txt is what > are usually called crawlers or spiders. > Primarily search engines, but also including things that aim to > automatically get a whole junk of a site. > Of course, there is no de jure standard, but the places I look seem to > lean to my view. > http://www.robotstxt.org/orig.html > "WWW Robots (also called wanderers or spiders) are programs that traverse > many pages in the World Wide Web by recursively retrieving linked pages.” > https://en.wikipedia.org/wiki/Web_robot > "Typically, bots perform tasks that are both simple and structurally > repetitive, at a much higher rate than would be possible for a human > alone. “ > It’s all about scale and query rate. > So a php script that fetches one URI now and then is not the target for > the restriction - nor indeed is my shell script that daily fetches a > common page I want to save on my laptop. > > So, I confess, when my system trips over a dbpedia (or any other) URI and > does follow-your-nose to get the RDF, it doesn’t check that the site > robots.txt allows it. > And I certainly don’t expect Linked Data consumers doing simple URI > resolution to check my robots.txt > > But you are right, if I am wrong - robots.txt would make no sense in the > Linked Data world, since pretty much by definition it will always be an > agent doing the access. > But then I think we really need a convention (User-agent: ?) that lets me > tell search engines to stay away, while allowing LD apps to access the > stuff they want. Then it seems our core disagreement is on the notion of a robot, which is indeed a grey area. With respect to robots only referring to warehouses/search engines, this was indeed the primary use-case for robots.txt, but for me it's just an instance of what robots.txt is used for. Rather than focus on what is a robot, I think it's important to look at (some of the commonly quoted reasons) why people use robots.txt and what the robots.txt requests: "Charles Stross claims to have provoked Koster to suggest robots.txt, after he wrote a badly-behaved web spider that caused an inadvertent denial of service attack on Koster's server." [1] Note that robots.txt has an optional Crawl-delay primitive. Other reasons: "A robots.txt file on a website will function as a request that specified robots ignore specified files or directories when crawling a site. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data." [1] So moving aside from the definition of a robot, more importantly, I think a domain administrator has very good grounds to be annoyed if any software agent (not just Google et al.) breaks the conditions requested in the robots.txt: conditions specifying local/sensitive data or request delays. Likewise, I think that a lot of Linked Data clients are liable to break such conditions, especially those that follow links on the fly. Such LD clients can cause, for example, DoS attacks by requesting lots of pages in parallel in a very short time, or externalisation of site content that was intended to be kept local. Aside from the crawlers of Linked Data warehouses, LD clients that could cause (D)DoS attacks include on-the-fly link traversal query execution engines, or navigational systems, or browsers that gather additional background information on-the-fly (like labels) from various surrounding documents. Other Linked Data clients may surface (blacklisted) data intended to be kept local on other sites. This would include LD clients that present blacklisted data from the site in question through an interface on another site, or that integrates said data with data from another site. For me, this captures pretty much all "non-trivial" Linked Data clients. My reasoning is then as follows: *) robots.txt should be respected by all software agents, not just warehouse crawlers, and not just Google et al; *) by their nature, robots.txt restrictions are relevant to most Linked Data clients and a (responsible) Linked Data client should not breach the stated conditions; *) a robots.txt blacklist will completely stop (responsible) Linked Data clients from being able to dereference URIs; *) a Linked Dataset behind a robots.txt blacklist is not a Linked Dataset. (There's maybe some shades of grey in there that I have not properly represented, but I think the arguments hold for the most part while avoiding the ambiguous question of "what is a robot?".) Best, Aidan [1] http://en.wikipedia.org/wiki/Robots_exclusion_standard
Received on Friday, 25 July 2014 21:14:17 UTC