- From: Stéphane Corlosquet <scorlosquet@gmail.com>
- Date: Tue, 11 Dec 2012 16:56:16 -0500
- To: Christian Bizer <chris@bizer.de>
- Cc: public-lod@w3.org, semantic-web@w3.org, public-vocabs@w3.org
- Message-ID: <CAGR+nnE=tL2aacZabBkhJx=6DuYp2oP4tg3-oVVcGP2XD9Z1Xg@mail.gmail.com>
Thanks for sharing these statistics and dataset. The Extraction Statistic File Header (tab-separated file) [1] includes a column 'referencedData' which is not explained in the list of columns headers at [2]. What does it correspond to? Steph. [1] http://www.webdatacommons.org/samples/pages.header.tab [2] http://www.webdatacommons.org/2012-08/stats/how_to_get_the_data.html#toc3 On Tue, Dec 11, 2012 at 9:48 AM, Christian Bizer <chris@bizer.de> wrote: > Hi all, > > more and more websites embed structured data describing for instance > products, people, organizations, places, events, resumes, and cooking > recipes into their HTML pages using markup formats such as RDFa, Microdata > and Microformats. > > The Web Data Commons project extracts all Microformat, Microdata and RDFa > data from theCommon Crawl web corpus, the largest and most up-to-data web > corpus that is currently available to the public, and provides the > extracted data for download. In addition, we calculate and publish > statistics about the deployment of the different formats as well as the > vocabularies that are used together with each format. > > Today, we are happy to announce the release of a new WebDataCommons > dataset. > > The dataset has been extracted from the latest version of the Common > Crawl. This August 2012 version of the > Common Crawl contains over 3 billion HTML pages which originate from over > 40 million websites (pay-level-domains). > > Altogether we discovered structured data within 369 million HTML pages > contained in the Common Crawl corpus (12.3%). The pages containing > structured data originate from 2.29 million websites (5.65%). > Approximately 519 thousand of these websites use RDFa, while 140 thousand > websites use Microdata. Microformats are used on 1.7 million websites. > > Basic statistics about the extracted dataset as well as the vocabularies > that are used together with each encoding format are found at: > > http://www.webdatacommons.org/**2012-08/stats/stats.html<http://www.webdatacommons.org/2012-08/stats/stats.html> > > Additional statistics that analyze top-level domain distribution and the > popularity of the websites covered by the Common Crawl, as well as the > topical domains of the embedded data are found at: > > http://www.webdatacommons.org/**2012-08/stats/additional_**stats.html<http://www.webdatacommons.org/2012-08/stats/additional_stats.html> > > The overall size of the August 2012 WebDataCommons dataset is 7.3 billion > quads. The dataset is split into 1,416 files each having a size of around > 100 MB. In order to make it easier to find data from a specific website or > top-level-domain, we provide indexes about the location of specific data > within the files. > > In order to make it easy for third parties to investigate the usage of > different vocabularies and to generate seed-lists for focused crawling > endeavors, we provide a website-class-property matrix for each format. The > matrixes indicate which vocabulary term (class/property) is used by which > website and avoid that you need to download and scan the whole dataset to > obtain this information. > > The extracted dataset and website-class-property matrix can be downloaded > from: > > http://www.webdatacommons.org/**2012-08/stats/how_to_get_the_**data.html<http://www.webdatacommons.org/2012-08/stats/how_to_get_the_data.html> > > Lots of thanks to: > > + the Common Crawl project for providing their great web crawl and thus > enabling the Web Data Commons project. > + the Any23 project for providing their great library of structured data > parsers. > + the PlanetData and the LOD2 EU research projects for supporting > WebDataCommons. > > Have fun with the new dataset. > > Cheers, > > Christian Bizer and Robert Meusel > > > -- > Prof. Dr. Christian Bizer > Chair of Information Systems V > Web-based Systems Group > Universität Mannheim > B6, 26, Room B1.15 > D-68131 Mannheim > Tel.: +49(0)621/181-2677 > Fax.: +49(0)621/181-2682 > Mail: chris@informatik.uni-mannheim.**de<chris@informatik.uni-mannheim.de> > Web: www.bizer.de > > > > > -- Steph.
Received on Tuesday, 11 December 2012 21:56:52 UTC