- From: Hugh Glaser <hugh@glasers.org>
- Date: Fri, 25 Jul 2014 19:44:50 +0100
- To: ahogan@dcc.uchile.cl
- Cc: public-lod@w3.org
Hi Aiden, I think I probably agree with everything you say, but with one exception: On 25 Jul 2014, at 19:14, ahogan@dcc.uchile.cl wrote: > found that the crawl encountered many problems accessing the various > datasets in the catalogue: robots.txt, 401s, 502s, bad conneg, 404/dead, > etc. The idea that having a robots.txt that Disallows spiders is a “problem” for a dataset is rather bizarre. It is of course a problem for the spider, but is clearly not a problem for a typical consumer of the dataset. By that measure, serious numbers of the web sites we all use on a daily basis are problematic. By the way, the reason this has come up for me is because I was quite happy not to be spidered for the BTC (a conscious decision), but I think that some of my datasets might be useful for people, so would prefer to see them included in the LOD Cloud. I actually didn’t submit a seed list to the BTC; but I had forgotten that we had robots.txt everywhere, so it wouldn’t have done it in any case! :-) Anyway, we just need to get around the problem, if we feel that this is all useful. So… Let’s do something about it. I’m no robots.txt expert, but I have changed the appropriate robots.txt to have: User-agent: LDSpider Allow: * User-agent: * Sitemap: http:/{}.rkbexplorer.com/sitemap.xml Disallow: /browse/ ... I wonder whether this (or something similar) is useful? I realise that it is now too late for the current activity (I assume), but I’ll just leave it all there for future stuff. Cheers -- Hugh Glaser 20 Portchester Rise Eastleigh SO50 4QS Mobile: +44 75 9533 4155, Home: +44 23 8061 5652
Received on Friday, 25 July 2014 18:46:13 UTC