Fwd: [ANN] WebDataCommons releases 82.1 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 14.6 million websites from Kingsley Idehen on 2022-01-12 (public-lod@w3.org from January 2022)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Wed, 12 Jan 2022 11:01:01 -0500
To: "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <746ff452-181f-a3d5-f8c2-cc552683d7d8@openlinksw.com>
-------- Forwarded Message --------

Subject:  [ANN] WebDataCommons releases 82.1 billion quads Microdata, 
Embedded JSON-LD, RDFa, and Microformat data originating from 14.6 
million websites
Resent-Date:  Tue, 11 Jan 2022 10:34:16 +0000
Resent-From:  public-vocabs@w3.org
Date:  Tue, 11 Jan 2022 11:33:49 +0100
From:  Anna Primpeli <anna@informatik.uni-mannheim.de>
To:  semantic-web@w3.org, public-schemaorg@w3.org, public-vocabs@w3.org



Hi all,

we are happy to announce the new release of the WebDataCommons 
Microdata, JSON-LD, RDFa and Microformat data corpus.

The data has been extracted from the October 2021 version of the Common 
Crawl covering 3.2 billion HTML pages which originate from 35.4 million 
websites (pay-level domains).

In summary, we found structured data within 1.5 billion HTML pages out 
of the 3.2 billion pages contained in the crawl (47.4%). These pages 
originate from 14.6 million different pay-level domains out of the 35.4 
million pay-level-domains covered by the crawl (41.1%).

Approximately 8.3 million websites provide structured data using the 
JSON-LD syntax, 7.8 million websites use the Microdata markup format to 
annotate structured data within their pages, while less than one million 
websites were found to use the RDFa markup format.

*Statistics about the October* *2021 Release:*

Basic statistics about the October 2021 Microdata, JSON-LD, RDFa, and 
Microformat data sets as well as the vocabularies that are used along 
with each markup format are found at:

http://webdatacommons.org/structureddata/2021-12/stats/stats.html

*Markup Format Adoption*

The WebDataCommons project has been extracting structured data from the 
CommonCrawl yearly since 2010. The October 2021 release signifies 11 
years of monitoring the adoption of structured data on the Web. This 
allows us to spot trends concerning the adoption of different markup 
formats as well as the usage of specific classes and properties, a short 
overview of which is provided on the page below:

http://webdatacommons.org/structureddata/#toc3

The first WDC release in 2010 revealed that only 5.7% of the examined 
webpages contained structured data. In 2021, we found structured data 
within 47.4% of the examined webpages indicating a huge growth in 
adoption over the last decade. The two markup formats that saw the 
largest increase in adoption are Microdata and JSON-LD. By 2021, 
Microdata and JSON-LD dominate over RDFa and other Microformats. More 
concretely, in the 2010 release Microdata was found only in less than 1% 
of the websites containing structured data while in the newest 2021 
release, the relative adoption is more than 53%. JSON-LD has been 
monitored by the WebDataCommons project since 2015 and was initially 
found in 21% of the websites deploying markup annotations. In 2021 more 
than 57% of the websites were found to use this markup format, which 
makes JSON-LD the most widely adopted markup format. In contrast, the 
relative adoption of RDFa and Microformats (hCard) has decreased over 
the last decade from 22% and 66% to 4.9% and 28.5%, respectively.

Looking at the richness of the Microdata and JSON-LD annotations which 
we can approximate by the average amount of triples per webpage, we can 
see that there is an overall increasing trend with some small 
fluctuations between the years for the Microdata format. On average, we 
extracted 21 Microdata triples from each webpage in 2010. The number of 
triples per page increased to 38 in 2016, while there was a slight 
decrease to 36 triples per webpage in 2021. The growth of the richness 
of JSON-LD annotations is even more significant with the average amount 
of triples per webpage continuously increasing from 10 in 2015 to 47 in 
2021. This indicates that JSON-LD data provides a higher level of detail 
in comparison to Microdata annotations.

The schema.org vocabulary remains the most popular in the context of 
Microdata and JSON-LD. It is used for annotating navigation elements 
within webpages, using classes such as /BreadcrumbList/, /SearchAction/ 
and /SiteNavigationElement,/ as well as the main content of a page, 
using classes like /Product/, /LocalBusiness,/ and /JobPosting/. We 
observe a rapidly increasing adoption of several content classes: Over 
the past four years the number of websites providing Product annotations 
increased from 594K to 2.5M (334% growth), the amount of websites 
annotating LocalBusiness entities increased from 386K to 727M (88% 
growth) while the adoption of the JobPosting class increased from 7K 
websites to 43K (514% growth).

Finally, we observe that an increasing number of websites explicitly 
annotates entity identifiers, such as product identifiers, as well as 
other identifying attributes such as telephone numbers or geo 
coordinates for local businesses. Schema.org provides different terms 
for annotating different types of product identifiers, with 
schema:Product/sku being the most popular among them. Over the past four 
years, the relative adoption of the schema:Product/sku property has 
increased from 21% to 55%. The properties schema:LocalBusiness/telephone 
and schema:LocalBusiness/geo have also seen a comparable increased 
growth in the last four years from 64% to 76% and from 6% to 22.5%, 
respectively. This verifies our previous observation on the increasing 
richness of the annotations.

**

*Download *

The overall size of the October 2021 RDFa, Microdata, Embedded JSON-LD 
and Microformat data sets is 82.1 billion RDF quads. For download, we 
split the data into 21,346 files with a total size of 1.6 TB.

http://webdatacommons.org/structureddata/2021-12/stats/how_to_get_the_data.html

In addition, we have created for 44 different schema.org 
<http://schema.org/> classes separate files, including all quads 
extracted from pages, using a specific schema.org class.

http://webdatacommons.org/structureddata/2021-12/stats/schema_org_subsets.html

**

*Lots of thanks to:*

+ the Common Crawl project for providing their great web crawl and 
thus enabling the WebDataCommons project.
+ the Any23 project for providing and maintaining their great library of 
structured data parsers.
+ Amazon Web Services in Education Grant for supporting WebDataCommons.


**

*General Information about the WebDataCommons Project*

Since 2010 the WebDataCommons project has yearly extracted structured 
data from the Common Crawl, the largest web corpus available to the 
public, and provides the extracted data for public download in order to 
support researchers and companies in exploiting the wealth of 
information that is available on the Web. Besides the yearly extractions 
of semantic annotations from webpages, the WebDataCommons project 
provides large hyperlink graphs, the largest public corpus of web 
tables, two corpora of product data, as well as a collection of 
hypernyms extracted from billions of web pages for public download. 
General information about the WebDataCommons project is found at

http://webdatacommons.org/


Have fun with the new data set.

Cheers,
Anna Primpeli, Alexander Brinkmann and Chris Bizer
Attachments

application/pkcs7-signature attachment: S/MIME Cryptographic Signature
Received on Wednesday, 12 January 2022 16:01:18 UTC