Re: Billion Triples Challenge Crawl 2014 from Michel Dumontier on 2014-02-14 (semantic-web@w3.org from February 2014)

From: Michel Dumontier <michel.dumontier@gmail.com>
Date: Thu, 13 Feb 2014 23:46:09 -0800
To: Andreas Harth <andreas@harth.org>
Cc: SWIG Web <semantic-web@w3.org>
Message-ID: <CALcEXf5rTNo9uNTb8H-9YSdcSfv6TdFDv_jZDJYOBPCmA2DsbA@mail.gmail.com>

Andreas,

 I'd like to help by getting bio2rdf data into the crawl, really. but we
gzip all of our files, and they are in n-quads format.

http://download.bio2rdf.org/release/3/

think you can add gzip/bzip2 support ?

m.

Michel Dumontier
Associate Professor of Medicine (Biomedical Informatics), Stanford
University
Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
http://dumontierlab.com


On Tue, Feb 11, 2014 at 4:36 AM, Andreas Harth <andreas@harth.org> wrote:

> Hello,
>
> we are about to start crawling for the Billion Triples Challenge
> 2014 dataset in the next couple of weeks.
>
> A few things you can do to prepare:
>
> 1. If you do not want our crawler to visit your site, provide
> a robots.txt [1].  For example:
>
> User-agent: ldspider
> Disallow: /
>
> 2. If you want our crawler to find your site, make sure you
> link your data from well-connected files (e.g. FOAF files).
> You can also send me an entry URI to your dataset which I will
> include in my FOAF file.
>
> 3. Even with an external link (more are better!), your dataset
> might not be well interlinked internally.  Make sure you have
> enough internal links so that the crawler can traverse your dataset.
>
> 4. We plan to wait two seconds between lookups per pay-level
> domain.  If your server cannot sustain that pace, please see #1.
>
> We are aware that LDSpider [2], our crawler, has a number of
> limitations:
>
> * The crawler supports only basic robots.txt features
> (i.e., exclusion)
> * The crawler does not understand sitemaps
> * The crawler does not handle zipped or gziped data dump files
>
> The Billion Triple Challenge datasets are a valuable resource and
> have been used in benchmarking, data analysis and as basis for
> applications.  Your data is appreciated.
>
> Many thanks in advance for your support!
>
> Cheers,
> Andreas Harth and Tobias Kaefer.
>
> [1] http://www.robotstxt.org/robotstxt.html
> [2] http://code.google.com/p/ldspider/
>
>

Received on Friday, 14 February 2014 07:46:57 UTC