Re: Updated LOD Cloud Diagram - Missed data sources. from Hugh Glaser on 2014-07-25 (public-lod@w3.org from July 2014)

From: Hugh Glaser <hugh@glasers.org>
Date: Fri, 25 Jul 2014 19:44:50 +0100
To: ahogan@dcc.uchile.cl
Cc: public-lod@w3.org
Message-Id: <FDCB84AC-FE99-47A5-9F2B-772142F66F64@glasers.org>

Hi Aiden,
I think I probably agree with everything you say, but with one exception:

On 25 Jul 2014, at 19:14, ahogan@dcc.uchile.cl wrote:

> found that the crawl encountered many problems accessing the various
> datasets in the catalogue: robots.txt, 401s, 502s, bad conneg, 404/dead,
> etc.
The idea that having a robots.txt that Disallows spiders is a “problem” for a dataset is rather bizarre.
It is of course a problem for the spider, but is clearly not a problem for a typical consumer of the dataset.
By that measure, serious numbers of the web sites we all use on a daily basis are problematic.

By the way, the reason this has come up for me is because I was quite happy not to be spidered for the BTC (a conscious decision), but I think that some of my datasets might be useful for people, so would prefer to see them included in the LOD Cloud.
I actually didn’t submit a seed list to the BTC; but I had forgotten that we had robots.txt everywhere, so it wouldn’t have done it in any case! :-)

Anyway, we just need to get around the problem, if we feel that this is all useful.
So…
Let’s do something about it.
I’m no robots.txt expert, but I have changed the appropriate robots.txt to have:
User-agent: LDSpider
Allow: *
User-agent: *
Sitemap: http:/{}.rkbexplorer.com/sitemap.xml
Disallow: /browse/
...

I wonder whether this (or something similar) is useful?
I realise that it is now too late for the current activity (I assume), but I’ll just leave it all there for future stuff.

Cheers

-- 
Hugh Glaser
   20 Portchester Rise
   Eastleigh
   SO50 4QS
Mobile: +44 75 9533 4155, Home: +44 23 8061 5652

Received on Friday, 25 July 2014 18:46:13 UTC