ANN: WebDataCommons.org releases 7.3 billion quads RDFa, Microdata and Microformat data originating from 2.29 million pay-level-domains

Hi all,

more and more websites embed structured data describing for instance 
products, people, organizations, places, events, resumes, and cooking 
recipes into their HTML pages using markup formats such as RDFa, 
Microdata and Microformats.

The Web Data Commons project extracts all Microformat, Microdata and 
RDFa data from theCommon Crawl web corpus, the largest and most 
up-to-data web corpus that is currently available to the public, and 
provides the extracted data for download. In addition, we calculate and 
publish statistics about the deployment of the different formats as well 
as the vocabularies that are used together with each format.

Today, we are happy to announce the release of a new WebDataCommons dataset.

The dataset has been extracted from the latest version of the Common 
Crawl. This August 2012 version of the
Common Crawl contains over 3 billion HTML pages which originate from 
over 40 million websites (pay-level-domains).

Altogether we discovered structured data within 369 million HTML pages 
contained in the Common Crawl corpus (12.3%). The pages containing 
structured data originate from 2.29 million websites (5.65%).  
Approximately 519 thousand of these websites use RDFa, while 140 
thousand websites use Microdata. Microformats are used on 1.7 million 
websites.

Basic statistics about the extracted dataset as well as the vocabularies 
that are used together with each encoding format are found at:

http://www.webdatacommons.org/2012-08/stats/stats.html

Additional statistics that analyze top-level domain distribution and the 
popularity of the websites covered by the Common Crawl, as well as the 
topical domains of the embedded data are found at:

http://www.webdatacommons.org/2012-08/stats/additional_stats.html

The overall size of the August 2012 WebDataCommons dataset is 7.3 
billion quads. The dataset is split into 1,416 files each having a size 
of around 100 MB. In order to make it easier to find data from a 
specific website or top-level-domain, we provide indexes about the 
location of specific data within the files.

In order to make it easy for third parties to investigate the usage of 
different vocabularies and to generate seed-lists for focused crawling 
endeavors, we provide a website-class-property matrix for each format. 
The matrixes indicate which vocabulary term (class/property) is used by 
which website and avoid that you need to download and scan the whole 
dataset to obtain this information.

The extracted dataset and website-class-property matrix can be 
downloaded from:

http://www.webdatacommons.org/2012-08/stats/how_to_get_the_data.html

Lots of thanks to:

+ the Common Crawl project for providing their great web crawl and thus 
enabling the Web Data Commons project.
+ the Any23 project for providing their great library of structured data 
parsers.
+ the PlanetData and the LOD2 EU research projects for supporting 
WebDataCommons.

Have fun with the new dataset.

Cheers,

Christian Bizer and Robert Meusel


-- 
Prof. Dr. Christian Bizer
Chair of Information Systems V
Web-based Systems Group
Universität Mannheim
B6, 26, Room B1.15
D-68131 Mannheim
Tel.: +49(0)621/181-2677
Fax.: +49(0)621/181-2682
Mail: chris@informatik.uni-mannheim.de
Web: www.bizer.de

Received on Tuesday, 11 December 2012 14:47:10 UTC