- From: Andreas Harth <andreas@harth.org>
- Date: Tue, 11 Feb 2014 13:36:15 +0100
- To: semantic-web@w3.org
Hello, we are about to start crawling for the Billion Triples Challenge 2014 dataset in the next couple of weeks. A few things you can do to prepare: 1. If you do not want our crawler to visit your site, provide a robots.txt [1]. For example: User-agent: ldspider Disallow: / 2. If you want our crawler to find your site, make sure you link your data from well-connected files (e.g. FOAF files). You can also send me an entry URI to your dataset which I will include in my FOAF file. 3. Even with an external link (more are better!), your dataset might not be well interlinked internally. Make sure you have enough internal links so that the crawler can traverse your dataset. 4. We plan to wait two seconds between lookups per pay-level domain. If your server cannot sustain that pace, please see #1. We are aware that LDSpider [2], our crawler, has a number of limitations: * The crawler supports only basic robots.txt features (i.e., exclusion) * The crawler does not understand sitemaps * The crawler does not handle zipped or gziped data dump files The Billion Triple Challenge datasets are a valuable resource and have been used in benchmarking, data analysis and as basis for applications. Your data is appreciated. Many thanks in advance for your support! Cheers, Andreas Harth and Tobias Kaefer. [1] http://www.robotstxt.org/robotstxt.html [2] http://code.google.com/p/ldspider/
Received on Tuesday, 11 February 2014 12:36:42 UTC