- From: Andreas Harth <andreas@harth.org>
- Date: Mon, 17 Feb 2014 12:59:54 +0100
- To: Hugh Glaser <hugh@glasers.org>
- CC: Semantic Web <semantic-web@w3.org>
Hi, On 02/16/2014 01:01 PM, Hugh Glaser wrote: > This is to the list because there may be issues that people would > like to discuss. +1 > So one question is, how do you feel about such stiff in the Crawl? > And another is, what should the Crawl do with such effectively > unbounded datasets? And indeed, unbounded ones such as > http://km.aifb.kit.edu/projects/numbers/ (built as an April Fool, but > that is actually useful), or some other datasets we now have that are > linked, unbounded, rdf? > > Personally I would like to see representation of these datasets in > the Crawl. These datasets will be represented, to a degree, as we cannot get the "entire" web of Linked Data (see Linked Open Numbers). We plan to crawl for about a month and see what we can get. If we assume a crawling delay of 2 seconds, we'll dereference at most (60*60*24)/2=43200 URIs per day per pay-level domain, which leads to around 1.3 million per pay-level domain for the entire crawl. At least that's our plan. > Again, personally I think that such datasets may well have a place in > the Crawl - perhaps it would encourage research to identify such > stuff before it becomes more widespread? Due to my aversion for manual work the crawler will just download those files indiscriminately. I agree with you that we'll need algorithms and methods to sort out the mess at some point. Cheers, Andreas.
Received on Monday, 17 February 2014 12:00:21 UTC