Billion Triples Challenge Crawl 2014 from Andreas Harth on 2014-02-11 (semantic-web@w3.org from February 2014)

From: Andreas Harth <andreas@harth.org>
Date: Tue, 11 Feb 2014 13:36:15 +0100
To: semantic-web@w3.org
Message-ID: <52FA193F.1010305@harth.org>

Hello,

we are about to start crawling for the Billion Triples Challenge
2014 dataset in the next couple of weeks.

A few things you can do to prepare:

1. If you do not want our crawler to visit your site, provide
a robots.txt [1].  For example:

User-agent: ldspider
Disallow: /

2. If you want our crawler to find your site, make sure you
link your data from well-connected files (e.g. FOAF files).
You can also send me an entry URI to your dataset which I will
include in my FOAF file.

3. Even with an external link (more are better!), your dataset
might not be well interlinked internally.  Make sure you have
enough internal links so that the crawler can traverse your dataset.

4. We plan to wait two seconds between lookups per pay-level
domain.  If your server cannot sustain that pace, please see #1.

We are aware that LDSpider [2], our crawler, has a number of
limitations:

* The crawler supports only basic robots.txt features
(i.e., exclusion)
* The crawler does not understand sitemaps
* The crawler does not handle zipped or gziped data dump files

The Billion Triple Challenge datasets are a valuable resource and
have been used in benchmarking, data analysis and as basis for
applications.  Your data is appreciated.

Many thanks in advance for your support!

Cheers,
Andreas Harth and Tobias Kaefer.

[1] http://www.robotstxt.org/robotstxt.html
[2] http://code.google.com/p/ldspider/

Received on Tuesday, 11 February 2014 12:36:42 UTC