Re: Updated LOD Cloud Diagram - Missed data sources. from Luca Matteis on 2014-07-25 (public-lod@w3.org from July 2014)

From: Luca Matteis <lmatteis@gmail.com>
Date: Sat, 26 Jul 2014 00:23:49 +0200
To: Hugh Glaser <hugh@glasers.org>
Cc: ahogan@dcc.uchile.cl, Linked Data community <public-lod@w3.org>
Message-ID: <CALp38EPRT5=ynmFXoJYMZJcADqiU56j8T=nLD+nwo8Wq0ixWsQ@mail.gmail.com>
Robots.txt to me works well for a web of documents. That is, wanting
only humans to access certain resources. But for a web of data, why
resort to a robots.txt when you could simply not put the resource
online in the first place?

On Fri, Jul 25, 2014 at 11:54 PM, Hugh Glaser <hugh@glasers.org> wrote:
> Hi,
> Well, as you might guess, I can’t say I agree.
> Firstly, as you correctly say, if there is a robots.txt with Disallow / on the RDF on a LD site, then it effectively prohibits any LD app from accessing the LD.
> So clearly that can’t be what the publisher intended (the idea of publishing RDF for humans to fetch is not a big market).
> So what did the publisher intend? This should be what the consumer aims to comply with.
> If you take a pragmatic (rather than perhaps more literal) view of what someone might mean when they put such a robots.txt on a LD site, then it can only mean "please only access my site in the sort of usage patterns that I might expect from a person” or similar.
>
> Secondly, I think in discussing robots, it is central to the issue to try to answer the question of "what is a robot?”, which is why I included that discussion, which is linked off reference to robots on the wikipedia page that you quote, rather than just the page you quote.
> The systems you describe are good questions, and I would say that in the end the builders have to decide whether their system is what the publisher might have thought of as a robot.
> My system (if I recall correctly!), monitors what it is accessing to ensure that it does not make undue demands on the LD sites it accesses; this is just good practice, irrespective of whether there is a Disallow or not, I think.
>
> I am guessing we will just have to differ on all this!
>
> Best
> Hugh
>
> On 25 Jul 2014, at 22:13, ahogan@dcc.uchile.cl wrote:
>
>>
>> On 25/07/2014 15:54, Hugh Glaser wrote:> Very interesting.
>>> On 25 Jul 2014, at 20:12, ahogan@dcc.uchile.cl wrote:
>>>
>>>> On 25/07/2014 14:44, Hugh Glaser wrote:
>>>>> The idea that having a robots.txt that Disallows spiders
>>>>> is a “problem” for a dataset is rather bizarre.
>>>>> It is of course a problem for the spider, but is clearly not a problem
>>>> for a
>>>>> typical consumer of the dataset.
>>>>> By that measure, serious numbers of the web sites we all use on a daily
>>>>> basis are problematic.
>>>> <snip>
>>>>
>>>> I think the general interpretation of the robots in "robots.txt" is any
>>>> software agent accessing the site "automatically" (versus a user manually
>>>> entering a URL).
>>> I had never thought this.
>>> My understanding of the agents that should respect the robots.txt is what
>>> are usually called crawlers or spiders.
>>> Primarily search engines, but also including things that aim to
>>> automatically get a whole junk of a site.
>>> Of course, there is no de jure standard, but the places I look seem to
>>> lean to my view.
>>> http://www.robotstxt.org/orig.html
>>> "WWW Robots (also called wanderers or spiders) are programs that traverse
>>> many pages in the World Wide Web by recursively retrieving linked pages.”
>>> https://en.wikipedia.org/wiki/Web_robot
>>> "Typically, bots perform tasks that are both simple and structurally
>>> repetitive, at a much higher rate than would be possible for a human
>>> alone. “
>>> It’s all about scale and query rate.
>>> So a php script that fetches one URI now and then is not the target for
>>> the restriction - nor indeed is my shell script that daily fetches a
>>> common page I want to save on my laptop.
>>>
>>> So, I confess, when my system trips over a dbpedia (or any other) URI and
>>> does follow-your-nose to get the RDF, it doesn’t check that the site
>>> robots.txt allows it.
>>> And I certainly don’t expect Linked Data consumers doing simple URI
>>> resolution to check my robots.txt
>>>
>>> But you are right, if I am wrong - robots.txt would make no sense in the
>>> Linked Data world, since pretty much by definition it will always be an
>>> agent doing the access.
>>> But then I think we really need a convention (User-agent: ?) that lets me
>>> tell search engines to stay away, while allowing LD apps to access the
>>> stuff they want.
>>
>> Then it seems our core disagreement is on the notion of a robot, which is
>> indeed a grey area. With respect to robots only referring to
>> warehouses/search engines, this was indeed the primary use-case for
>> robots.txt, but for me it's just an instance of what robots.txt is used
>> for.
>>
>> Rather than focus on what is a robot, I think it's important to look at
>> (some of the commonly quoted reasons) why people use robots.txt and what
>> the robots.txt requests:
>>
>>
>> "Charles Stross claims to have provoked Koster to suggest robots.txt,
>> after he wrote a badly-behaved web spider that caused an inadvertent
>> denial of service attack on Koster's server." [1]
>>
>> Note that robots.txt has an optional Crawl-delay primitive. Other reasons:
>>
>> "A robots.txt file on a website will function as a request that specified
>> robots ignore specified files or directories when crawling a site. This
>> might be, for example, out of a preference for privacy from search engine
>> results, or the belief that the content of the selected directories might
>> be misleading or irrelevant to the categorization of the site as a whole,
>> or out of a desire that an application only operate on certain data." [1]
>>
>>
>> So moving aside from the definition of a robot, more importantly, I think
>> a domain administrator has very good grounds to be annoyed if any software
>> agent (not just Google et al.) breaks the conditions requested in the
>> robots.txt: conditions specifying local/sensitive data or request delays.
>>
>> Likewise, I think that a lot of Linked Data clients are liable to break
>> such conditions, especially those that follow links on the fly. Such LD
>> clients can cause, for example, DoS attacks by requesting lots of pages in
>> parallel in a very short time, or externalisation of site content that was
>> intended to be kept local.
>>
>> Aside from the crawlers of Linked Data warehouses, LD clients that could
>> cause (D)DoS attacks include on-the-fly link traversal query execution
>> engines, or navigational systems, or browsers that gather additional
>> background information on-the-fly (like labels) from various surrounding
>> documents.
>>
>> Other Linked Data clients may surface (blacklisted) data intended to be
>> kept local on other sites. This would include LD clients that present
>> blacklisted data from the site in question through an interface on another
>> site, or that integrates said data with data from another site.
>>
>> For me, this captures pretty much all "non-trivial" Linked Data clients.
>>
>>
>> My reasoning is then as follows:
>>
>> *) robots.txt should be respected by all software agents, not just
>> warehouse crawlers, and not just Google et al;
>> *) by their nature, robots.txt restrictions are relevant to most Linked
>> Data clients and a (responsible) Linked Data client should not breach the
>> stated conditions;
>> *) a robots.txt blacklist will completely stop (responsible) Linked Data
>> clients from being able to dereference URIs;
>> *) a Linked Dataset behind a robots.txt blacklist is not a Linked Dataset.
>>
>> (There's maybe some shades of grey in there that I have not properly
>> represented, but I think the arguments hold for the most part while
>> avoiding the ambiguous question of "what is a robot?".)
>>
>> Best,
>> Aidan
>>
>> [1] http://en.wikipedia.org/wiki/Robots_exclusion_standard
>>
>>
>
> --
> Hugh Glaser
>    20 Portchester Rise
>    Eastleigh
>    SO50 4QS
> Mobile: +44 75 9533 4155, Home: +44 23 8061 5652
>
>
>
Received on Friday, 25 July 2014 22:24:17 UTC