- From: Aaron Bradley <aaranged@gmail.com>
- Date: Tue, 11 Dec 2012 10:07:45 -0800
- To: Christian Bizer <chris@bizer.de>
- Cc: public-lod@w3.org, semantic-web@w3.org, Public Vocabs <public-vocabs@w3.org>
- Message-ID: <CAMbipBuS0cM04ydvwF+v8c0tGJKq=zpXBFU838S+N7_er0UPwg@mail.gmail.com>
Thanks for this Christian and Robert. I've been comparing the 2009/2010 corpus against this, and have noticed the comparative growth of RDFa and microdata (proportionally). Is this a valid apples-to-apples comparison? That is, do the two data sets (2009/2010 and Aug. 2012) reference the same domain set, or might a comparison of the two data sets be ill-advised because of the variance in items selected in each case? On Tue, Dec 11, 2012 at 6:48 AM, Christian Bizer <chris@bizer.de> wrote: > Hi all, > > more and more websites embed structured data describing for instance > products, people, organizations, places, events, resumes, and cooking > recipes into their HTML pages using markup formats such as RDFa, Microdata > and Microformats. > > The Web Data Commons project extracts all Microformat, Microdata and RDFa > data from theCommon Crawl web corpus, the largest and most up-to-data web > corpus that is currently available to the public, and provides the > extracted data for download. In addition, we calculate and publish > statistics about the deployment of the different formats as well as the > vocabularies that are used together with each format. > > Today, we are happy to announce the release of a new WebDataCommons > dataset. > > The dataset has been extracted from the latest version of the Common > Crawl. This August 2012 version of the > Common Crawl contains over 3 billion HTML pages which originate from over > 40 million websites (pay-level-domains). > > Altogether we discovered structured data within 369 million HTML pages > contained in the Common Crawl corpus (12.3%). The pages containing > structured data originate from 2.29 million websites (5.65%). > Approximately 519 thousand of these websites use RDFa, while 140 thousand > websites use Microdata. Microformats are used on 1.7 million websites. > > Basic statistics about the extracted dataset as well as the vocabularies > that are used together with each encoding format are found at: > > http://www.webdatacommons.org/**2012-08/stats/stats.html<http://www.webdatacommons.org/2012-08/stats/stats.html> > > Additional statistics that analyze top-level domain distribution and the > popularity of the websites covered by the Common Crawl, as well as the > topical domains of the embedded data are found at: > > http://www.webdatacommons.org/**2012-08/stats/additional_**stats.html<http://www.webdatacommons.org/2012-08/stats/additional_stats.html> > > The overall size of the August 2012 WebDataCommons dataset is 7.3 billion > quads. The dataset is split into 1,416 files each having a size of around > 100 MB. In order to make it easier to find data from a specific website or > top-level-domain, we provide indexes about the location of specific data > within the files. > > In order to make it easy for third parties to investigate the usage of > different vocabularies and to generate seed-lists for focused crawling > endeavors, we provide a website-class-property matrix for each format. The > matrixes indicate which vocabulary term (class/property) is used by which > website and avoid that you need to download and scan the whole dataset to > obtain this information. > > The extracted dataset and website-class-property matrix can be downloaded > from: > > http://www.webdatacommons.org/**2012-08/stats/how_to_get_the_**data.html<http://www.webdatacommons.org/2012-08/stats/how_to_get_the_data.html> > > Lots of thanks to: > > + the Common Crawl project for providing their great web crawl and thus > enabling the Web Data Commons project. > + the Any23 project for providing their great library of structured data > parsers. > + the PlanetData and the LOD2 EU research projects for supporting > WebDataCommons. > > Have fun with the new dataset. > > Cheers, > > Christian Bizer and Robert Meusel > > > -- > Prof. Dr. Christian Bizer > Chair of Information Systems V > Web-based Systems Group > Universität Mannheim > B6, 26, Room B1.15 > D-68131 Mannheim > Tel.: +49(0)621/181-2677 > Fax.: +49(0)621/181-2682 > Mail: chris@informatik.uni-mannheim.**de<chris@informatik.uni-mannheim.de> > Web: www.bizer.de > > > > > -- *Resistance is futile: I have been assimilated.* Into the Google world that is. Accordingly, *my default email address is now aaranged@gmail.com*. If I'm in your address book as aaranged@yahoo.com you may want to update that detail (but Yahoo! mail will continue to be forwarded to this account).
Received on Tuesday, 11 December 2012 18:08:23 UTC