Re: Billion Triples Challenge Crawl 2014 from Andreas Harth on 2014-02-15 (semantic-web@w3.org from February 2014)

From: Andreas Harth <andreas@harth.org>
Date: Sat, 15 Feb 2014 16:31:32 +0100
To: semantic-web@w3.org
Message-ID: <52FF8854.9020607@harth.org>

Michel,

On 02/14/2014 08:46 AM, Michel Dumontier wrote:
>   I'd like to help by getting bio2rdf data into the crawl, really. but
> we gzip all of our files, and they are in n-quads format.
>
> http://download.bio2rdf.org/release/3/
>
> think you can add gzip/bzip2 support ?

we have to start with the crawl soon, and adding support for data dumps
will take some time.  We can put that feature on our list and see if
we can make the feature for the 2015 crawl.

As only 3.81% of Linking Open Data cloud datasets provide a data dump
([1], page 38), we have to decide whether supporting data dumps in the
crawler is a high-priority feature.

Please also note that data dumps do not include 303 redirect
information, something we record when we do lookups during crawling.

There are means to link to data dumps in VoID files, which would
enable a crawler to automatically discover the data dump URIs.
However, I suspect that will be true for less than 3.81 % of the
datasets.

My goal is to carry out the crawling without manual intervention (i.e.
I point the crawling agent to a set of seed URIs and off it goes).  The
4 Linked Data principles are nice as they provide an elegant minimal
set of conventions that an automated agent needs to support.

Cheers,
Andreas.

[1] http://wiki.planet-data.eu/uploads/7/79/D2.4.pdf

Received on Saturday, 15 February 2014 15:31:57 UTC