WebDataCommons releases 86.3 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 15.3 million websites

Hi all,

we are happy to announce the new release of the WebDataCommons Microdata,
JSON-LD, RDFa and Microformat data corpus.

The data has been extracted from the September 2020 version of the Common
Crawl covering 3.4 billion HTML pages which originate from 34.5 million
websites (pay-level domains). For the extraction of structured data, the
newest version 2.4 of the any23 library was used.

In summary, we found structured data within 1.7 billion HTML pages out of
the 3.4 billion pages contained in the crawl (50%). These pages originate
from 15.3 million different pay-level domains out of the 34.5 million
pay-level-domains covered by the crawl (44.3%). Last year, we only found
structured data in 37% of the pages and on 37.2% of the pay-level-domains.

Approximately 7.8 million of the 2020 websites use Microdata, 7.6 million
websites use JSON-LD, and 3.3 million websites make use of RDFa.
Microformats are used by more than 4 million websites within the crawl.

 

Statistics about the December 2020 Release:

Basic statistics about the December 2020 Microdata, JSON-LD, RDFa, and
Microformat data sets as well as the vocabularies that are used together
with each markup format are found at: 

http://webdatacommons.org/structureddata/2020-12/stats/stats.html

 

Markup Format Adoption

The page below provides an overview of trends in the adoption of the
different markup formats as well as widely used schema.org classes in the
timespan 2012 to 2020:

http://webdatacommons.org/structureddata/#toc3 

Comparing the statistics from the new 2020 release to the statistics about
the 2019 release of the data sets

http://webdatacommons.org/structureddata/2019-12/stats/stats.html

we can observe that although the overall number of pages in the crawl is by
38.9% larger in comparison to the crawl used for the 2019 release, the
corresponding growth in terms of domains is only 7.9%, indicating that the
crawl corpus used this year is much deeper in comparison to the one of last
year. However, we see that more and more websites annotate their content, as
the yearly increase of the domains having annotated data was more than 28%.
The markup format with the largest domain growth in adoption (>50%) is
JSON-LD. The growing trend of the JSON-LD format becomes even more obvious
in certain domains, such as hotels.com and yahoo.com, which have switched
from using Microdata to using JSON-LD as dominant markup language.
Concerning the vocabulary adoption, schema.org continues to be the most
dominant vocabulary. More concretely, the classes schema:WebPage,
schema:Product, schema:Rating, schema:Organization and schema:Person saw a
major adoption increase in comparison to 2019 (>40%). Looking at the
richness of JSON-LD descriptions, we notice that the average number of
triples per URL has grown from 29 in 2019 to 41 in 2020 and has now reached
a similar level of detail as the Microdata annotations (avg 39 triples per
URL).

 

Download 

The overall size of the December 2020 RDFa, Microdata, Embedded JSON-LD and
Microformat data sets is 86.3 billion RDF quads. For download, we split the
data into 21,346 files with a total size of 1.9 TB.

http://webdatacommons.org/structureddata/2020-12/stats/how_to_get_the_data.h
tml

In addition, we have created for over 43 different  <http://schema.org/>
schema.org classes separate files, including all quads extracted from pages,
using a specific schema.org class. 

http://webdatacommons.org/structureddata/2020-12/stats/schema_org_subsets.ht
ml

 

Lots of thanks to: 

+ the Common Crawl project for providing their great web crawl and thus
enabling the WebDataCommons project. 
+ the Any23 project for providing and maintaining their great library of
structured data parsers. 
+ Amazon Web Services in Education Grant for supporting WebDataCommons. 




General Information about the WebDataCommons Project

The WebDataCommons project extracts yearly since 2012 structured data from
the Common Crawl, the largest web corpus available to the public, and
provides the extracted data for public download in order to support
researchers and companies in exploiting the wealth of information that is
available on the Web. Beside of the yearly extractions of semantic
annotations from webpages, the WebDataCommons project also provides large
hyperlink graphs, the largest public corpus of web tables, two corpora of
product data, as well as a collection of hypernyms extracted from billions
of web pages for public download. General information about the
WebDataCommons project is found at 

http://webdatacommons.org/


Have fun with the new data set. 



Cheers, 


Anna Primpeli, Alexander Brinkmann and Chris Bizer

 

 

 

Received on Thursday, 21 January 2021 09:58:41 UTC