Re: Billion Triples Challenge Crawl 2014 from Hugh Glaser on 2014-02-16 (semantic-web@w3.org from February 2014)

From: Hugh Glaser <hugh@glasers.org>
Date: Sun, 16 Feb 2014 12:01:20 +0000
To: Andreas Harth <andreas@harth.org>
Cc: Semantic Web <semantic-web@w3.org>
Message-Id: <53BF6464-E087-4681-93F8-869052DF5FFE@glasers.org>
Hi again,

This is to the list because there may be issues that people would like to discuss.

There are some interesting datasets you might not otherwise find at http://data.totl.net
For example
http://data.totl.net/chess/state/rnbqkbnr_pppppppp_8_8_8_8_PPPPPPPP_RNBQKBNR_w_KQkq_-_0_1
This alone has far more than a Billion Triples. :-) (Shannon estimated 10^^43)
Lest anyone think that such a dataset is not useful, I point out that it can be used as the basis for a game player:
http://graphite.ecs.soton.ac.uk/examples/playgame.php?state=hhttp://data.totl.net/chess/state/rnbqkbnr_pppppppp_8_8_8_8_PPPPPPPP_RNBQKBNR_w_KQkq_-_0_1
And, perhaps more interestingly, would be a fascinating basis for a KB of games represented in a triplestore, which would show the linkage of every position to every other game with that played position in the KB (no searching required!).

So one question is, how do you feel about such stiff in the Crawl?
And another is, what should the Crawl do with such effectively unbounded datasets?
And indeed, unbounded ones such as http://km.aifb.kit.edu/projects/numbers/ (built as an April Fool, but that is actually useful), or some other datasets we now have that are linked, unbounded, rdf?

Personally I would like to see representation of these datasets in the Crawl.

Another question arises from the stuff over on the right of the page (Perverse Datasets).
Some, such as Chris who built data.totl.net, would argue that the Semantic Web hasn’t really arrived until it is used for Spam and other perhaps “destructive" purposes.
Things like http://data.totl.net/dave.rdf and http://data.totl.net/same_name.rdf might be of interest in this respect.

Again, personally I think that such datasets may well have a place in the Crawl - perhaps it would encourage research to identify such stuff before it becomes more widespread?

Best
Hugh

On 11 Feb 2014, at 12:36, Andreas Harth <andreas@harth.org> wrote:

> Hello,
> 
> we are about to start crawling for the Billion Triples Challenge
> 2014 dataset in the next couple of weeks.
> 
> A few things you can do to prepare:
> 
> 1. If you do not want our crawler to visit your site, provide
> a robots.txt [1].  For example:
> 
> User-agent: ldspider
> Disallow: /
> 
> 2. If you want our crawler to find your site, make sure you
> link your data from well-connected files (e.g. FOAF files).
> You can also send me an entry URI to your dataset which I will
> include in my FOAF file.
> 
> 3. Even with an external link (more are better!), your dataset
> might not be well interlinked internally.  Make sure you have
> enough internal links so that the crawler can traverse your dataset.
> 
> 4. We plan to wait two seconds between lookups per pay-level
> domain.  If your server cannot sustain that pace, please see #1.
> 
> We are aware that LDSpider [2], our crawler, has a number of
> limitations:
> 
> * The crawler supports only basic robots.txt features
> (i.e., exclusion)
> * The crawler does not understand sitemaps
> * The crawler does not handle zipped or gziped data dump files
> 
> The Billion Triple Challenge datasets are a valuable resource and
> have been used in benchmarking, data analysis and as basis for
> applications.  Your data is appreciated.
> 
> Many thanks in advance for your support!
> 
> Cheers,
> Andreas Harth and Tobias Kaefer.
> 
> [1] http://www.robotstxt.org/robotstxt.html
> [2] http://code.google.com/p/ldspider/
> 

-- 
Hugh Glaser
   20 Portchester Rise
   Eastleigh
   SO50 4QS
Mobile: +44 75 9533 4155, Home: +44 23 8061 5652
Received on Sunday, 16 February 2014 12:01:48 UTC