[ANN] WebDataCommons releases 82.1 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 14.6 million websites

From: Anna Primpeli <anna@informatik.uni-mannheim.de>
Date: Tue, 11 Jan 2022 11:33:49 +0100
To: <semantic-web@w3.org>, <public-schemaorg@w3.org>, <public-vocabs@w3.org>
Message-ID: <002a01d806d6$b862f290$2928d7b0$@informatik.uni-mannheim.de>
Hi all,

we are happy to announce the new release of the WebDataCommons Microdata,
JSON-LD, RDFa and Microformat data corpus.

The data has been extracted from the October 2021 version of the Common
Crawl covering 3.2 billion HTML pages which originate from 35.4 million
websites (pay-level domains). 

In summary, we found structured data within 1.5 billion HTML pages out of
the 3.2 billion pages contained in the crawl (47.4%). These pages originate
from 14.6 million different pay-level domains out of the 35.4 million
pay-level-domains covered by the crawl (41.1%). 

Approximately 8.3 million websites provide structured data using the JSON-LD
syntax, 7.8 million websites use the Microdata markup format to annotate
structured data within their pages, while less than one million websites
were found to use the RDFa markup format.


Statistics about the October 2021 Release:

Basic statistics about the October 2021 Microdata, JSON-LD, RDFa, and
Microformat data sets as well as the vocabularies that are used along with
each markup format are found at: 



Markup Format Adoption

The WebDataCommons project has been extracting structured data from the
CommonCrawl yearly since 2010. The October 2021 release signifies 11 years
of monitoring the adoption of structured data on the Web. This allows us to
spot trends concerning the adoption of different markup formats as well as
the usage of specific classes and properties, a short overview of which is
provided on the page below:


The first WDC release in 2010 revealed that only 5.7% of the examined
webpages contained structured data. In 2021, we found structured data within
47.4% of the examined webpages indicating a huge growth in adoption over the
last decade. The two markup formats that saw the largest increase in
adoption are Microdata and JSON-LD. By 2021, Microdata and JSON-LD dominate
over RDFa and other Microformats. More concretely, in the 2010 release
Microdata was found only in less than 1% of the websites containing
structured data while in the newest 2021 release, the relative adoption is
more than 53%. JSON-LD has been monitored by the WebDataCommons project
since 2015 and was initially found in 21% of the websites deploying markup
annotations. In 2021 more than 57% of the websites were found to use this
markup format, which makes JSON-LD the most widely adopted markup format. In
contrast, the relative adoption of RDFa and Microformats (hCard) has
decreased over the last decade from 22% and 66% to 4.9% and 28.5%,

Looking at the richness of the Microdata and JSON-LD annotations which we
can approximate by the average amount of triples per webpage, we can see
that there is an overall increasing trend with some small fluctuations
between the years for the Microdata format. On average, we extracted 21
Microdata triples from each webpage in 2010. The number of triples per page
increased to 38 in 2016, while there was a slight decrease to 36 triples per
webpage in 2021. The growth of the richness of JSON-LD annotations is even
more significant with the average amount of triples per webpage continuously
increasing from 10 in 2015 to 47 in 2021. This indicates that JSON-LD data
provides a higher level of detail in comparison to Microdata annotations.  

The schema.org vocabulary remains the most popular in the context of
Microdata and JSON-LD. It is used for annotating navigation elements within
webpages, using classes such as BreadcrumbList, SearchAction and
SiteNavigationElement, as well as the main content of a page, using classes
like Product, LocalBusiness, and JobPosting. We observe a rapidly increasing
adoption of several content classes: Over the past four years the number of
websites providing Product annotations increased from 594K to 2.5M (334%
growth), the amount of websites annotating LocalBusiness entities increased
from 386K to 727M (88% growth) while the adoption of the JobPosting class
increased from 7K websites to 43K (514% growth). 

Finally, we observe that an increasing number of websites explicitly
annotates entity identifiers, such as product identifiers, as well as other
identifying attributes such as telephone numbers or geo coordinates for
local businesses. Schema.org provides different terms for annotating
different types of product identifiers, with schema:Product/sku being the
most popular among them. Over the past four years, the relative adoption of
the schema:Product/sku property has increased from 21% to 55%. The
properties schema:LocalBusiness/telephone and schema:LocalBusiness/geo have
also seen a comparable increased growth in the last four years from 64% to
76% and from 6% to 22.5%, respectively. This verifies our previous
observation on the increasing richness of the annotations.



The overall size of the October 2021 RDFa, Microdata, Embedded JSON-LD and
Microformat data sets is 82.1 billion RDF quads. For download, we split the
data into 21,346 files with a total size of 1.6 TB.


In addition, we have created for 44 different  <http://schema.org/>
schema.org classes separate files, including all quads extracted from pages,
using a specific schema.org class. 



Lots of thanks to: 

+ the Common Crawl project for providing their great web crawl and thus
enabling the WebDataCommons project. 
+ the Any23 project for providing and maintaining their great library of
structured data parsers. 
+ Amazon Web Services in Education Grant for supporting WebDataCommons. 

General Information about the WebDataCommons Project

Since 2010 the WebDataCommons project has yearly extracted structured data
from the Common Crawl, the largest web corpus available to the public, and
provides the extracted data for public download in order to support
researchers and companies in exploiting the wealth of information that is
available on the Web. Besides the yearly extractions of semantic annotations
from webpages, the WebDataCommons project provides large hyperlink graphs,
the largest public corpus of web tables, two corpora of product data, as
well as a collection of hypernyms extracted from billions of web pages for
public download. General information about the WebDataCommons project is
found at 


Have fun with the new data set. 

Anna Primpeli, Alexander Brinkmann and Chris Bizer

