- From: Christian Bizer <chris@bizer.de>
- Date: Tue, 11 Dec 2012 15:48:02 +0100
- To: public-lod@w3.org, semantic-web@w3.org, public-vocabs@w3.org
Hi all, more and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using markup formats such as RDFa, Microdata and Microformats. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from theCommon Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provides the extracted data for download. In addition, we calculate and publish statistics about the deployment of the different formats as well as the vocabularies that are used together with each format. Today, we are happy to announce the release of a new WebDataCommons dataset. The dataset has been extracted from the latest version of the Common Crawl. This August 2012 version of the Common Crawl contains over 3 billion HTML pages which originate from over 40 million websites (pay-level-domains). Altogether we discovered structured data within 369 million HTML pages contained in the Common Crawl corpus (12.3%). The pages containing structured data originate from 2.29 million websites (5.65%). Approximately 519 thousand of these websites use RDFa, while 140 thousand websites use Microdata. Microformats are used on 1.7 million websites. Basic statistics about the extracted dataset as well as the vocabularies that are used together with each encoding format are found at: http://www.webdatacommons.org/2012-08/stats/stats.html Additional statistics that analyze top-level domain distribution and the popularity of the websites covered by the Common Crawl, as well as the topical domains of the embedded data are found at: http://www.webdatacommons.org/2012-08/stats/additional_stats.html The overall size of the August 2012 WebDataCommons dataset is 7.3 billion quads. The dataset is split into 1,416 files each having a size of around 100 MB. In order to make it easier to find data from a specific website or top-level-domain, we provide indexes about the location of specific data within the files. In order to make it easy for third parties to investigate the usage of different vocabularies and to generate seed-lists for focused crawling endeavors, we provide a website-class-property matrix for each format. The matrixes indicate which vocabulary term (class/property) is used by which website and avoid that you need to download and scan the whole dataset to obtain this information. The extracted dataset and website-class-property matrix can be downloaded from: http://www.webdatacommons.org/2012-08/stats/how_to_get_the_data.html Lots of thanks to: + the Common Crawl project for providing their great web crawl and thus enabling the Web Data Commons project. + the Any23 project for providing their great library of structured data parsers. + the PlanetData and the LOD2 EU research projects for supporting WebDataCommons. Have fun with the new dataset. Cheers, Christian Bizer and Robert Meusel -- Prof. Dr. Christian Bizer Chair of Information Systems V Web-based Systems Group Universität Mannheim B6, 26, Room B1.15 D-68131 Mannheim Tel.: +49(0)621/181-2677 Fax.: +49(0)621/181-2682 Mail: chris@informatik.uni-mannheim.de Web: www.bizer.de
Received on Tuesday, 11 December 2012 14:47:10 UTC