- From: Luca Matteis <lmatteis@gmail.com>
- Date: Sat, 26 Jul 2014 00:23:49 +0200
- To: Hugh Glaser <hugh@glasers.org>
- Cc: ahogan@dcc.uchile.cl, Linked Data community <public-lod@w3.org>
Robots.txt to me works well for a web of documents. That is, wanting only humans to access certain resources. But for a web of data, why resort to a robots.txt when you could simply not put the resource online in the first place? On Fri, Jul 25, 2014 at 11:54 PM, Hugh Glaser <hugh@glasers.org> wrote: > Hi, > Well, as you might guess, I can’t say I agree. > Firstly, as you correctly say, if there is a robots.txt with Disallow / on the RDF on a LD site, then it effectively prohibits any LD app from accessing the LD. > So clearly that can’t be what the publisher intended (the idea of publishing RDF for humans to fetch is not a big market). > So what did the publisher intend? This should be what the consumer aims to comply with. > If you take a pragmatic (rather than perhaps more literal) view of what someone might mean when they put such a robots.txt on a LD site, then it can only mean "please only access my site in the sort of usage patterns that I might expect from a person” or similar. > > Secondly, I think in discussing robots, it is central to the issue to try to answer the question of "what is a robot?”, which is why I included that discussion, which is linked off reference to robots on the wikipedia page that you quote, rather than just the page you quote. > The systems you describe are good questions, and I would say that in the end the builders have to decide whether their system is what the publisher might have thought of as a robot. > My system (if I recall correctly!), monitors what it is accessing to ensure that it does not make undue demands on the LD sites it accesses; this is just good practice, irrespective of whether there is a Disallow or not, I think. > > I am guessing we will just have to differ on all this! > > Best > Hugh > > On 25 Jul 2014, at 22:13, ahogan@dcc.uchile.cl wrote: > >> >> On 25/07/2014 15:54, Hugh Glaser wrote:> Very interesting. >>> On 25 Jul 2014, at 20:12, ahogan@dcc.uchile.cl wrote: >>> >>>> On 25/07/2014 14:44, Hugh Glaser wrote: >>>>> The idea that having a robots.txt that Disallows spiders >>>>> is a “problem” for a dataset is rather bizarre. >>>>> It is of course a problem for the spider, but is clearly not a problem >>>> for a >>>>> typical consumer of the dataset. >>>>> By that measure, serious numbers of the web sites we all use on a daily >>>>> basis are problematic. >>>> <snip> >>>> >>>> I think the general interpretation of the robots in "robots.txt" is any >>>> software agent accessing the site "automatically" (versus a user manually >>>> entering a URL). >>> I had never thought this. >>> My understanding of the agents that should respect the robots.txt is what >>> are usually called crawlers or spiders. >>> Primarily search engines, but also including things that aim to >>> automatically get a whole junk of a site. >>> Of course, there is no de jure standard, but the places I look seem to >>> lean to my view. >>> http://www.robotstxt.org/orig.html >>> "WWW Robots (also called wanderers or spiders) are programs that traverse >>> many pages in the World Wide Web by recursively retrieving linked pages.” >>> https://en.wikipedia.org/wiki/Web_robot >>> "Typically, bots perform tasks that are both simple and structurally >>> repetitive, at a much higher rate than would be possible for a human >>> alone. “ >>> It’s all about scale and query rate. >>> So a php script that fetches one URI now and then is not the target for >>> the restriction - nor indeed is my shell script that daily fetches a >>> common page I want to save on my laptop. >>> >>> So, I confess, when my system trips over a dbpedia (or any other) URI and >>> does follow-your-nose to get the RDF, it doesn’t check that the site >>> robots.txt allows it. >>> And I certainly don’t expect Linked Data consumers doing simple URI >>> resolution to check my robots.txt >>> >>> But you are right, if I am wrong - robots.txt would make no sense in the >>> Linked Data world, since pretty much by definition it will always be an >>> agent doing the access. >>> But then I think we really need a convention (User-agent: ?) that lets me >>> tell search engines to stay away, while allowing LD apps to access the >>> stuff they want. >> >> Then it seems our core disagreement is on the notion of a robot, which is >> indeed a grey area. With respect to robots only referring to >> warehouses/search engines, this was indeed the primary use-case for >> robots.txt, but for me it's just an instance of what robots.txt is used >> for. >> >> Rather than focus on what is a robot, I think it's important to look at >> (some of the commonly quoted reasons) why people use robots.txt and what >> the robots.txt requests: >> >> >> "Charles Stross claims to have provoked Koster to suggest robots.txt, >> after he wrote a badly-behaved web spider that caused an inadvertent >> denial of service attack on Koster's server." [1] >> >> Note that robots.txt has an optional Crawl-delay primitive. Other reasons: >> >> "A robots.txt file on a website will function as a request that specified >> robots ignore specified files or directories when crawling a site. This >> might be, for example, out of a preference for privacy from search engine >> results, or the belief that the content of the selected directories might >> be misleading or irrelevant to the categorization of the site as a whole, >> or out of a desire that an application only operate on certain data." [1] >> >> >> So moving aside from the definition of a robot, more importantly, I think >> a domain administrator has very good grounds to be annoyed if any software >> agent (not just Google et al.) breaks the conditions requested in the >> robots.txt: conditions specifying local/sensitive data or request delays. >> >> Likewise, I think that a lot of Linked Data clients are liable to break >> such conditions, especially those that follow links on the fly. Such LD >> clients can cause, for example, DoS attacks by requesting lots of pages in >> parallel in a very short time, or externalisation of site content that was >> intended to be kept local. >> >> Aside from the crawlers of Linked Data warehouses, LD clients that could >> cause (D)DoS attacks include on-the-fly link traversal query execution >> engines, or navigational systems, or browsers that gather additional >> background information on-the-fly (like labels) from various surrounding >> documents. >> >> Other Linked Data clients may surface (blacklisted) data intended to be >> kept local on other sites. This would include LD clients that present >> blacklisted data from the site in question through an interface on another >> site, or that integrates said data with data from another site. >> >> For me, this captures pretty much all "non-trivial" Linked Data clients. >> >> >> My reasoning is then as follows: >> >> *) robots.txt should be respected by all software agents, not just >> warehouse crawlers, and not just Google et al; >> *) by their nature, robots.txt restrictions are relevant to most Linked >> Data clients and a (responsible) Linked Data client should not breach the >> stated conditions; >> *) a robots.txt blacklist will completely stop (responsible) Linked Data >> clients from being able to dereference URIs; >> *) a Linked Dataset behind a robots.txt blacklist is not a Linked Dataset. >> >> (There's maybe some shades of grey in there that I have not properly >> represented, but I think the arguments hold for the most part while >> avoiding the ambiguous question of "what is a robot?".) >> >> Best, >> Aidan >> >> [1] http://en.wikipedia.org/wiki/Robots_exclusion_standard >> >> > > -- > Hugh Glaser > 20 Portchester Rise > Eastleigh > SO50 4QS > Mobile: +44 75 9533 4155, Home: +44 23 8061 5652 > > >
Received on Friday, 25 July 2014 22:24:17 UTC