- From: Chris Bizer <chris@bizer.de>
- Date: Fri, 17 Jun 2011 13:51:47 +0200
- To: "'Giovanni Tummarello'" <giovanni.tummarello@deri.org>
- Cc: "'Semantic Web'" <semantic-web@w3.org>, "'public-lod'" <public-lod@w3.org>, <semanticweb@yahoogroups.com>
- Message-ID: <00bf01cc2ce4$ef950c30$cebf2490$@bizer.de>
Hi Giovanni, yes, it’s great that you and your team have provided the Sindice crawl as a dump for TREC2011. Now, the community has two large-scale datasets for experimentation. Your dataset that covers various types of structured data on the Web (RDFa, WebAPIs, microformats …) as well as the new Billion Triple Dataset that focuses on data that is published according to the Linked Data principles. Our dataset is relatively current (May/June 2011) and we also still provide the 2010 and 2009 versions of the dataset for download so that people can analyze the evolution of the Web of Linked Data. Your dataset covers the whole time span (2009-2011). Does the dataset contain any meta-information about how old specific parts of the dataset are so that people can also analyze the evolution? Let’s hope that Google, Yahoo or Micosoft will soon start providing an API over the Schema.org data that they extract from webpages (or even provide this data as a dump). Then, the community would have three real-world datasets as a basis for future research J Cheers, Chris Von: g.tummarello@gmail.com [mailto:g.tummarello@gmail.com] Im Auftrag von Giovanni Tummarello Gesendet: Freitag, 17. Juni 2011 13:35 An: Chris Bizer Cc: Semantic Web; public-lod; semanticweb@yahoogroups.com Betreff: Re: Semantic Web Challenge 2011 CfP and Billion Triple Challenge 2011 Data Set published. This year, the Billion Triple Challenge data set consists of 2 billion triples. The dataset was crawled during May/June 2011 using a random sample of URIs from the BTC 2010 dataset as seed URIs. Lots of thanks to Andreas Harth for all his effort put into crawling the web to compile this dataset, and to the Karlsruher Institut für Technologie which provided the necessary hardware for this labour-intensive task. On a related note, while nothing can beat a custom job obviously, i feel like reminding that those that don't have said mighty time/money/resources that any amount of data that one wants rom the repositories in Sindice which we do make freely available for things like this. (0 to 20++ billion triples, LOD or non LOD, microformats, RDFa, custom filtered etc) See the TREC 2011 competition http://data.sindice.com/trec2011/download.html (1TB+ of data!) or the recent W3C data anaysis which is leading to a new reccomendation (http://www.w3.org/2010/02/rdfa/profile/data/) etc. trying to help. Congrats on the great job guys of course for the Semantic web challenge which is a long standing great initiative! Gio
Received on Friday, 17 June 2011 11:50:18 UTC