ANN: WebDataCommons releases 24.4 billion quads RDFa, Microdata, Embedded JSON-LD and Microformat data originating from 2.7 million pay-level-domains

Hi All,

we are happy to announce a new release of the Web Data Commons RDFa,
Microdata, Embedded JSON-LD and Microformat data corpus.

The data corpus have been extracted from the November 2015 version of the
Common Crawl covering 1.77 billion HTML pages which originate from 14.4
million websites (pay-level domains).

Altogether we discovered structured data within 541 million HTML pages out
of the 1.77 billion pages contained in the crawl (30%). These pages
originate from 2.7 million different pay-level-domains out of the 14.4
million pay-level domains covered by the crawl (19%).

Approximately 521 thousand of these websites use RDFa, while 1.1 million
websites use Microdata. Microformats are used also by over 1 million
websites withinthe crawl. For the first time, we have also extracted
embedded json-ld 
<https://developers.google.com/schemas/formats/json-ld>which we can 
report to be used by more
than 596 thousand websites.

Background:

More and more websites embed structured data describing for instance
products, people, organizations, places, events, reviews, and cooking
recipes into their HTML pages using markup formats such as RDFa, Microdata
and Microformats.

The WebDataCommons project extracts all Microformat, Microdata and RDFa
data, and since 2015 also embedded JSON-LD data from the Common Crawl
web corpus, the largest and most up-to-data webcorpus that is
available to the public, andprovides the extracted data for download.
In addition, we publish statisticsabout the adoption of the different
markup formats as well as thevocabularies that are used together
with each format.

Besides the data extracted from the named markup syntaxes the
WebDataCommons project also provides one of the largest public
accessible corpora of WebTables extracted from web crawls as well
as a collection of hypernyms extract from billions of web pages for 
download.

General information about the WebDataCommons project is found at

http://webdatacommons.org/


Data Set Statistics:

Basic statistics about the November 2015 RDFa, Microdata, Embedded JSON-LD
and Microformatdata sets as well as the vocabularies that are used 
together with each
markup format are found at:

http://webdatacommons.org/structureddata/2015-11/stats/stats.html

Comparing the statistics to the statistics about the December 2014
release of the data sets

http://webdatacommons.org/structureddata/2014-12/stats/stats.html

we see that the adoption of the Microdata markup syntax has again
increased (1.1 million websites in 2015 compared to 819 thousand in
2014, where both crawls cover a comparable number of websites).
Where thedeployment of RDFa and Microformats is more or less stable.

As already observed in the former year the vocabulary schema.org,
recommended by Google, Microsoft, Yahoo!, and Yandex is most
frequently used by the webmasters in the context of Microdata.
We observe a decreasing deployment of its predecessor, the data vocabulary.
In the context of RDFa, we still find the Open Graph Protocol
recommended by Facebook to be the most widely used vocabulary.

Topic-wise the trends identified in the former extractions continue.
We see that beside of navigational, blog and CMS related
meta-information, that many websites annotate e-commerce related data
(Products, Offers, and Reviews) as well as contact information
(LocalBusiness, Organization, PostalAddress).

For the first time, we have also extracted information marked up
using embedded JSON-LD. Over 99% of all webmasters using
this syntax use it to mark-up search boxes on their
webpages (http://schema.org/SearchAction). Only a small part of the
websites also use embedded JSON-LD to annotate other
information, e.g. about organizations (92 thousand websites)
or persons (18 thousand websites).


Download:

The overall size of the November 2015 RDFa, Microdata, Embedded
JSON-LD and Microformat datasets is 24.4 billion RDF quads.
For download, we split the data into 3,961files with a total size of 404 GB.

http://webdatacommons.org/structureddata/2015-11/stats/how_to_get_the_data.html

In addition, we have created for over 50 differentschema.org 
<http://schema.org/>classes
separate files, including all quads from pages, deploying at least once 
the specific class.

http://webdatacommons.org/structureddata/2015-11/stats/schema_org_subsets.html


Lots of thanks to:

+ the Common Crawl project for providing their great web crawl and thus
enabling the Web Data Commons project.
+ the Any23 project for providing their great library of structured data
parsers.
+ Amazon Web Services in Education Grant for supporting WebDataCommons.


Have fun with the new data set.

Cheers,
Robert Meusel and Christian Bizer

Received on Monday, 25 April 2016 13:33:09 UTC