Re: ANN: releases 7.3 billion quads RDFa, Microdata and Microformat data originating from 2.29 million pay-level-domains

Thanks for this Christian and Robert.

I've been comparing the 2009/2010 corpus against this, and have noticed the
comparative growth of RDFa and microdata (proportionally).

Is this a valid apples-to-apples comparison?  That is, do the two data sets
(2009/2010 and Aug. 2012) reference the same domain set, or might a
comparison of the two data sets be ill-advised because of the variance in
items selected in each case?

On Tue, Dec 11, 2012 at 6:48 AM, Christian Bizer <> wrote:

> Hi all,
> more and more websites embed structured data describing for instance
> products, people, organizations, places, events, resumes, and cooking
> recipes into their HTML pages using markup formats such as RDFa, Microdata
> and Microformats.
> The Web Data Commons project extracts all Microformat, Microdata and RDFa
> data from theCommon Crawl web corpus, the largest and most up-to-data web
> corpus that is currently available to the public, and provides the
> extracted data for download. In addition, we calculate and publish
> statistics about the deployment of the different formats as well as the
> vocabularies that are used together with each format.
> Today, we are happy to announce the release of a new WebDataCommons
> dataset.
> The dataset has been extracted from the latest version of the Common
> Crawl. This August 2012 version of the
> Common Crawl contains over 3 billion HTML pages which originate from over
> 40 million websites (pay-level-domains).
> Altogether we discovered structured data within 369 million HTML pages
> contained in the Common Crawl corpus (12.3%). The pages containing
> structured data originate from 2.29 million websites (5.65%).
>  Approximately 519 thousand of these websites use RDFa, while 140 thousand
> websites use Microdata. Microformats are used on 1.7 million websites.
> Basic statistics about the extracted dataset as well as the vocabularies
> that are used together with each encoding format are found at:
> Additional statistics that analyze top-level domain distribution and the
> popularity of the websites covered by the Common Crawl, as well as the
> topical domains of the embedded data are found at:
> The overall size of the August 2012 WebDataCommons dataset is 7.3 billion
> quads. The dataset is split into 1,416 files each having a size of around
> 100 MB. In order to make it easier to find data from a specific website or
> top-level-domain, we provide indexes about the location of specific data
> within the files.
> In order to make it easy for third parties to investigate the usage of
> different vocabularies and to generate seed-lists for focused crawling
> endeavors, we provide a website-class-property matrix for each format. The
> matrixes indicate which vocabulary term (class/property) is used by which
> website and avoid that you need to download and scan the whole dataset to
> obtain this information.
> The extracted dataset and website-class-property matrix can be downloaded
> from:
> Lots of thanks to:
> + the Common Crawl project for providing their great web crawl and thus
> enabling the Web Data Commons project.
> + the Any23 project for providing their great library of structured data
> parsers.
> + the PlanetData and the LOD2 EU research projects for supporting
> WebDataCommons.
> Have fun with the new dataset.
> Cheers,
> Christian Bizer and Robert Meusel
> --
> Prof. Dr. Christian Bizer
> Chair of Information Systems V
> Web-based Systems Group
> Universitšt Mannheim
> B6, 26, Room B1.15
> D-68131 Mannheim
> Tel.: +49(0)621/181-2677
> Fax.: +49(0)621/181-2682
> Mail: chris@informatik.uni-mannheim.**de<>
> Web:

*Resistance is futile:  I have been assimilated.*  Into the Google world
that is.  Accordingly, *my default email address is now*.
If I'm in your address book as you may want to update
that detail (but Yahoo! mail will continue to be forwarded to this account).

Received on Tuesday, 11 December 2012 18:08:23 UTC