Re: Updated LOD Cloud Diagram - Missed data sources. from Hugh Glaser on 2014-07-25 (public-lod@w3.org from July 2014)

From: Hugh Glaser <hugh@glasers.org>
Date: Sat, 26 Jul 2014 00:34:49 +0100
To: Luca Matteis <lmatteis@gmail.com>
Cc: ahogan@dcc.uchile.cl, Linked Data community <public-lod@w3.org>
Message-Id: <7EBEDD87-354A-47E7-BB97-F761AACDAABE@glasers.org>
Hi Luca,
Thanks for asking.

I have resources that number 100Ms and even 1Bs of resolvable URIs.
I even have datasets with effectively infinite numbers of URIs.
Some people seem to find them useful, in the sense that they want to look specific things up.
These are not documents - they are dynamically generated RDF documents from SQL, triple or other storage mechanisms.
It can be a serious cost to me in terms of server processor, network and disk cost (I do some caching to trade processor cost against disk space) to allow crawlers to try to spider serious parts or all of the dataset.
Some of the documents can take several seconds of CPU to generate.
(Since all this is unfunded most costs come out of my pocket, by the way.)
So it may be that avoiding spiders is the difference between me offering the dataset and not - or at least it means that the service that the “real” users get is not overwhelmed by the bots.

So what I want to do is make the datasets available, but I don’t want to bear the costs of having Google, Bing, or anyone else, actually crawling the site.
And no, I don’t want to have anything more than URI resolution, by having people register or authenticate - I want access to be as easy as possible - URI resolution.
Actually, spidering is what the sitemap (which I put work into building if one is possible) is for.

Oh, and I should say that the dynamic nature of the data means that the last modified and similar headers cannot be reliable set, and so bots would find incremental spidering rather challenging.

And I do think what I say applies to the web of documents.
Would a web site manager really object to me having a script that occasionally got some news or weather and displayed it on a web page.

By the way, I see that the standard Drupal instance puts this in the robots.txt:
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.

That sort of sums up what I want.

But now I seem to be repeating myself :-)

Best
Hugh

On 25 Jul 2014, at 23:23, Luca Matteis <lmatteis@gmail.com> wrote:

> Robots.txt to me works well for a web of documents. That is, wanting
> only humans to access certain resources. But for a web of data, why
> resort to a robots.txt when you could simply not put the resource
> online in the first place?
> 
> On Fri, Jul 25, 2014 at 11:54 PM, Hugh Glaser <hugh@glasers.org> wrote:
>> Hi,
>> Well, as you might guess, I can’t say I agree.
>> Firstly, as you correctly say, if there is a robots.txt with Disallow / on the RDF on a LD site, then it effectively prohibits any LD app from accessing the LD.
>> So clearly that can’t be what the publisher intended (the idea of publishing RDF for humans to fetch is not a big market).
>> So what did the publisher intend? This should be what the consumer aims to comply with.
>> If you take a pragmatic (rather than perhaps more literal) view of what someone might mean when they put such a robots.txt on a LD site, then it can only mean "please only access my site in the sort of usage patterns that I might expect from a person” or similar.
>> 
>> Secondly, I think in discussing robots, it is central to the issue to try to answer the question of "what is a robot?”, which is why I included that discussion, which is linked off reference to robots on the wikipedia page that you quote, rather than just the page you quote.
>> The systems you describe are good questions, and I would say that in the end the builders have to decide whether their system is what the publisher might have thought of as a robot.
>> My system (if I recall correctly!), monitors what it is accessing to ensure that it does not make undue demands on the LD sites it accesses; this is just good practice, irrespective of whether there is a Disallow or not, I think.
>> 
>> I am guessing we will just have to differ on all this!
>> 
>> Best
>> Hugh
>> 
>> On 25 Jul 2014, at 22:13, ahogan@dcc.uchile.cl wrote:
>> 
>>> 
>>> On 25/07/2014 15:54, Hugh Glaser wrote:> Very interesting.
>>>> On 25 Jul 2014, at 20:12, ahogan@dcc.uchile.cl wrote:
>>>> 
>>>>> On 25/07/2014 14:44, Hugh Glaser wrote:
>>>>>> The idea that having a robots.txt that Disallows spiders
>>>>>> is a “problem” for a dataset is rather bizarre.
>>>>>> It is of course a problem for the spider, but is clearly not a problem
>>>>> for a
>>>>>> typical consumer of the dataset.
>>>>>> By that measure, serious numbers of the web sites we all use on a daily
>>>>>> basis are problematic.
>>>>> <snip>
>>>>> 
>>>>> I think the general interpretation of the robots in "robots.txt" is any
>>>>> software agent accessing the site "automatically" (versus a user manually
>>>>> entering a URL).
>>>> I had never thought this.
>>>> My understanding of the agents that should respect the robots.txt is what
>>>> are usually called crawlers or spiders.
>>>> Primarily search engines, but also including things that aim to
>>>> automatically get a whole junk of a site.
>>>> Of course, there is no de jure standard, but the places I look seem to
>>>> lean to my view.
>>>> http://www.robotstxt.org/orig.html
>>>> "WWW Robots (also called wanderers or spiders) are programs that traverse
>>>> many pages in the World Wide Web by recursively retrieving linked pages.”
>>>> https://en.wikipedia.org/wiki/Web_robot
>>>> "Typically, bots perform tasks that are both simple and structurally
>>>> repetitive, at a much higher rate than would be possible for a human
>>>> alone. “
>>>> It’s all about scale and query rate.
>>>> So a php script that fetches one URI now and then is not the target for
>>>> the restriction - nor indeed is my shell script that daily fetches a
>>>> common page I want to save on my laptop.
>>>> 
>>>> So, I confess, when my system trips over a dbpedia (or any other) URI and
>>>> does follow-your-nose to get the RDF, it doesn’t check that the site
>>>> robots.txt allows it.
>>>> And I certainly don’t expect Linked Data consumers doing simple URI
>>>> resolution to check my robots.txt
>>>> 
>>>> But you are right, if I am wrong - robots.txt would make no sense in the
>>>> Linked Data world, since pretty much by definition it will always be an
>>>> agent doing the access.
>>>> But then I think we really need a convention (User-agent: ?) that lets me
>>>> tell search engines to stay away, while allowing LD apps to access the
>>>> stuff they want.
>>> 
>>> Then it seems our core disagreement is on the notion of a robot, which is
>>> indeed a grey area. With respect to robots only referring to
>>> warehouses/search engines, this was indeed the primary use-case for
>>> robots.txt, but for me it's just an instance of what robots.txt is used
>>> for.
>>> 
>>> Rather than focus on what is a robot, I think it's important to look at
>>> (some of the commonly quoted reasons) why people use robots.txt and what
>>> the robots.txt requests:
>>> 
>>> 
>>> "Charles Stross claims to have provoked Koster to suggest robots.txt,
>>> after he wrote a badly-behaved web spider that caused an inadvertent
>>> denial of service attack on Koster's server." [1]
>>> 
>>> Note that robots.txt has an optional Crawl-delay primitive. Other reasons:
>>> 
>>> "A robots.txt file on a website will function as a request that specified
>>> robots ignore specified files or directories when crawling a site. This
>>> might be, for example, out of a preference for privacy from search engine
>>> results, or the belief that the content of the selected directories might
>>> be misleading or irrelevant to the categorization of the site as a whole,
>>> or out of a desire that an application only operate on certain data." [1]
>>> 
>>> 
>>> So moving aside from the definition of a robot, more importantly, I think
>>> a domain administrator has very good grounds to be annoyed if any software
>>> agent (not just Google et al.) breaks the conditions requested in the
>>> robots.txt: conditions specifying local/sensitive data or request delays.
>>> 
>>> Likewise, I think that a lot of Linked Data clients are liable to break
>>> such conditions, especially those that follow links on the fly. Such LD
>>> clients can cause, for example, DoS attacks by requesting lots of pages in
>>> parallel in a very short time, or externalisation of site content that was
>>> intended to be kept local.
>>> 
>>> Aside from the crawlers of Linked Data warehouses, LD clients that could
>>> cause (D)DoS attacks include on-the-fly link traversal query execution
>>> engines, or navigational systems, or browsers that gather additional
>>> background information on-the-fly (like labels) from various surrounding
>>> documents.
>>> 
>>> Other Linked Data clients may surface (blacklisted) data intended to be
>>> kept local on other sites. This would include LD clients that present
>>> blacklisted data from the site in question through an interface on another
>>> site, or that integrates said data with data from another site.
>>> 
>>> For me, this captures pretty much all "non-trivial" Linked Data clients.
>>> 
>>> 
>>> My reasoning is then as follows:
>>> 
>>> *) robots.txt should be respected by all software agents, not just
>>> warehouse crawlers, and not just Google et al;
>>> *) by their nature, robots.txt restrictions are relevant to most Linked
>>> Data clients and a (responsible) Linked Data client should not breach the
>>> stated conditions;
>>> *) a robots.txt blacklist will completely stop (responsible) Linked Data
>>> clients from being able to dereference URIs;
>>> *) a Linked Dataset behind a robots.txt blacklist is not a Linked Dataset.
>>> 
>>> (There's maybe some shades of grey in there that I have not properly
>>> represented, but I think the arguments hold for the most part while
>>> avoiding the ambiguous question of "what is a robot?".)
>>> 
>>> Best,
>>> Aidan
>>> 
>>> [1] http://en.wikipedia.org/wiki/Robots_exclusion_standard
>>> 
>>> 
>> 
>> --
>> Hugh Glaser
>>   20 Portchester Rise
>>   Eastleigh
>>   SO50 4QS
>> Mobile: +44 75 9533 4155, Home: +44 23 8061 5652
>> 
>> 
>> 
> 

-- 
Hugh Glaser
   20 Portchester Rise
   Eastleigh
   SO50 4QS
Mobile: +44 75 9533 4155, Home: +44 23 8061 5652
Received on Friday, 25 July 2014 23:36:14 UTC