- From: Michel Dumontier <michel.dumontier@gmail.com>
- Date: Thu, 13 Feb 2014 23:46:09 -0800
- To: Andreas Harth <andreas@harth.org>
- Cc: SWIG Web <semantic-web@w3.org>
- Message-ID: <CALcEXf5rTNo9uNTb8H-9YSdcSfv6TdFDv_jZDJYOBPCmA2DsbA@mail.gmail.com>
Andreas, I'd like to help by getting bio2rdf data into the crawl, really. but we gzip all of our files, and they are in n-quads format. http://download.bio2rdf.org/release/3/ think you can add gzip/bzip2 support ? m. Michel Dumontier Associate Professor of Medicine (Biomedical Informatics), Stanford University Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group http://dumontierlab.com On Tue, Feb 11, 2014 at 4:36 AM, Andreas Harth <andreas@harth.org> wrote: > Hello, > > we are about to start crawling for the Billion Triples Challenge > 2014 dataset in the next couple of weeks. > > A few things you can do to prepare: > > 1. If you do not want our crawler to visit your site, provide > a robots.txt [1]. For example: > > User-agent: ldspider > Disallow: / > > 2. If you want our crawler to find your site, make sure you > link your data from well-connected files (e.g. FOAF files). > You can also send me an entry URI to your dataset which I will > include in my FOAF file. > > 3. Even with an external link (more are better!), your dataset > might not be well interlinked internally. Make sure you have > enough internal links so that the crawler can traverse your dataset. > > 4. We plan to wait two seconds between lookups per pay-level > domain. If your server cannot sustain that pace, please see #1. > > We are aware that LDSpider [2], our crawler, has a number of > limitations: > > * The crawler supports only basic robots.txt features > (i.e., exclusion) > * The crawler does not understand sitemaps > * The crawler does not handle zipped or gziped data dump files > > The Billion Triple Challenge datasets are a valuable resource and > have been used in benchmarking, data analysis and as basis for > applications. Your data is appreciated. > > Many thanks in advance for your support! > > Cheers, > Andreas Harth and Tobias Kaefer. > > [1] http://www.robotstxt.org/robotstxt.html > [2] http://code.google.com/p/ldspider/ > >
Received on Friday, 14 February 2014 07:46:57 UTC