Re: Semantic Web Challenge 2011 CfP and Billion Triple Challenge 2011 Data Set published. from Chris Bizer on 2011-06-17 (semantic-web@w3.org from June 2011)

From: Chris Bizer <chris@bizer.de>
Date: Fri, 17 Jun 2011 13:51:47 +0200
To: "'Giovanni Tummarello'" <giovanni.tummarello@deri.org>
Cc: "'Semantic Web'" <semantic-web@w3.org>, "'public-lod'" <public-lod@w3.org>, <semanticweb@yahoogroups.com>
Message-ID: <00bf01cc2ce4$ef950c30$cebf2490$@bizer.de>

Hi Giovanni,

 

yes, it’s great that you and your team have provided the Sindice crawl as a
dump for TREC2011.

 

Now, the community has two large-scale datasets for experimentation. 

 

Your dataset that covers various types of structured data on the Web (RDFa,
WebAPIs, microformats …)  as well as the new Billion Triple Dataset that
focuses on data that is published according to the Linked Data principles. 

 

Our dataset is relatively current (May/June 2011) and we also still provide
the 2010 and 2009 versions of the dataset for download so that people can
analyze the evolution of the Web of Linked Data.

 

Your dataset covers the whole time span (2009-2011). Does the dataset
contain any meta-information about how old specific parts of the dataset are
so that people can also analyze the evolution? 

 

Let’s hope that Google, Yahoo or Micosoft will soon start providing an API
over the Schema.org data that they extract from webpages (or even provide
this data as a dump).

 

Then, the community would have three real-world datasets as a basis for
future research J

 

Cheers,

 

Chris

 

 

 

Von: g.tummarello@gmail.com [mailto:g.tummarello@gmail.com] Im Auftrag von
Giovanni Tummarello
Gesendet: Freitag, 17. Juni 2011 13:35
An: Chris Bizer
Cc: Semantic Web; public-lod; semanticweb@yahoogroups.com
Betreff: Re: Semantic Web Challenge 2011 CfP and Billion Triple Challenge
2011 Data Set published.

 

 

This year, the Billion Triple Challenge data set consists of 2 billion
triples. The dataset was crawled during May/June 2011 using a random sample
of URIs from the BTC 2010 dataset as seed URIs. Lots of thanks to Andreas
Harth for all his effort put into crawling the web to compile this dataset,
and to the Karlsruher Institut für Technologie which provided the necessary
hardware for this labour-intensive task.

 

 

 

On a related note, 

 

 while nothing can beat a custom job obviously,

 

i feel like reminding that those that don't have said mighty
time/money/resources that any amount of data that one wants  rom the
repositories in Sindice which we do make freely available for things like
this. (0 to 20++ billion triples, LOD or non LOD, microformats, RDFa, custom
filtered etc)

 

See the  TREC 2011 competition
http://data.sindice.com/trec2011/download.html (1TB+ of data!)  or the
recent W3C data anaysis which is leading to a new reccomendation
(http://www.w3.org/2010/02/rdfa/profile/data/)  etc.

 

trying to help. 

Congrats on the great job guys of course for the Semantic web challenge
which is a long standing great initiative!

Gio

Received on Friday, 17 June 2011 11:50:11 UTC