Re: Billion Triples Challenge Crawl 2014 from Andreas Harth on 2014-02-17 (semantic-web@w3.org from February 2014)

From: Andreas Harth <andreas@harth.org>
Date: Mon, 17 Feb 2014 12:59:54 +0100
To: Hugh Glaser <hugh@glasers.org>
CC: Semantic Web <semantic-web@w3.org>
Message-ID: <5301F9BA.40806@harth.org>

Hi,

On 02/16/2014 01:01 PM, Hugh Glaser wrote:
> This is to the list because there may be issues that people would
> like to discuss.

+1

> So one question is, how do you feel about such stiff in the Crawl?
> And another is, what should the Crawl do with such effectively
> unbounded datasets? And indeed, unbounded ones such as
> http://km.aifb.kit.edu/projects/numbers/ (built as an April Fool, but
> that is actually useful), or some other datasets we now have that are
> linked, unbounded, rdf?
>
> Personally I would like to see representation of these datasets in
> the Crawl.

These datasets will be represented, to a degree, as we cannot get the
"entire"  web of Linked Data (see Linked Open Numbers).  We plan to
crawl for about a month and see what we can get.

If we assume a crawling delay of 2 seconds, we'll dereference at most
(60*60*24)/2=43200 URIs per day per pay-level domain, which leads to
around 1.3 million per pay-level domain for the entire crawl.  At least
that's our plan.

> Again, personally I think that such datasets may well have a place in
> the Crawl - perhaps it would encourage research to identify such
> stuff before it becomes more widespread?

Due to my aversion for manual work the crawler will just download those
files indiscriminately.  I agree with you that we'll need algorithms
and methods to sort out the mess at some point.

Cheers,
Andreas.

Received on Monday, 17 February 2014 12:00:21 UTC