[ANN] WebDataCommons releases 86.4 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 14.2 million websites from Alexander Brinkmann on 2023-01-25 (public-schemaorg@w3.org from January 2023)

From: Alexander Brinkmann <alexander.brinkmann@uni-mannheim.de>
Date: Wed, 25 Jan 2023 12:41:22 +0000
To: "public-schemaorg@w3.org" <public-schemaorg@w3.org>
Message-ID: <7328d2e73285498a8492588562a64095@uni-mannheim.de>

Hi all,

we are happy to announce the new release of the WebDataCommons Microdata, JSON-LD, RDFa and Microformat data corpus.

The data has been extracted from the September/October 2022 version of the Common Crawl covering 3.15 billion HTML pages which originate from 35.4 million websites (pay-level domains).

In summary, we found structured data within 1.5 billion HTML pages out of the 3.15 billion pages contained in the crawl (46.88%). These pages originate from 14.2 million different pay-level-domains out of the 33.8 million pay-level-domains covered by the crawl (42.01%).

Approximately 8.6 million websites provide structured data using the JSON-LD syntax, 7.5 million websites use the Microdata markup format to annotate structured data within their pages, while half a million websites were found to use the RDFa markup format.

Statistics about the October 2022 Release:

Basic statistics about the October 2022 Microdata, JSON-LD, RDFa, and Microformat data sets as well as the vocabularies that are used along with each markup format are found at:

https://webdatacommons.org/structureddata/2022-12/stats/stats.html<http://webdatacommons.org/structureddata/2022-12/stats/stats.html>

Markup Format Adoption

The WebDataCommons project has been extracting structured data from the CommonCrawl yearly since 2010. The October 2022 release signifies 11 years of monitoring the adoption of structured data on the Web. This allows us to spot trends concerning the adoption of different markup formats as well as the usage of specific classes and properties, a short overview of which is provided on the page below:

https://webdatacommons.org/structureddata/#toc3<http://webdatacommons.org/structureddata/#toc3>

The first WDC release in 2010 revealed that only 5.7% of the examined webpages contained structured data. In 2022, we found structured data within 46.9% of the examined webpages indicating a huge growth in adoption over the last decade. The two markup formats that saw the largest increase in adoption are JSON-LD and Microdata. By 2022, JSON-LD and Microdata dominate over RDFa and other Microformats. More concretely, in the 2010 release Microdata was found only in less than 1% of the websites containing structured data while in the newest 2022 release, the relative adoption is more than 52%. JSON-LD has been monitored by the WebDataCommons project since 2015 and was initially found in 21% of the websites deploying markup annotations. In 2022 more than 60% of the websites were found to use this markup format, which makes JSON-LD the most widely adopted markup format. In contrast, the relative adoption of RDFa and Microformats (hCard) has decreased over the last decade from 22% and 66% to 4% and 27%, respectively.

Looking at the richness of the Microdata and JSON-LD annotations which we can approximate by the average amount of triples per webpage, we can see that there is an overall increasing trend with some small fluctuations between the years for the Microdata format. On average, we extracted 21 Microdata triples from each webpage in 2010. The number of triples per page increased to 38 in 2016, while there was a slight decrease to 36 triples per webpage in 2022. The growth of the richness of JSON-LD annotations is even more significant with the average amount of triples per webpage continuously increasing from 10 in 2015 to 52 in 2022. This indicates that JSON-LD data provides a higher level of detail in comparison to Microdata annotations.

The schema.org vocabulary remains the most popular in the context of Microdata and JSON-LD. It is used for annotating navigation elements within webpages, using classes such as BreadcrumbList, WebPage and SiteNavigationElement, as well as the main content of a page, using classes like Product, LocalBusiness, and JobPosting. We observe a rapidly increasing adoption of several content classes: Over the past four years the number of websites providing Product annotations increased from 594K to 2.6M (430% growth), the amount of websites annotating LocalBusiness entities increased from 386K to 1.2M (310% growth) while the adoption of the JobPosting class increased from 7K websites to 50K (721% growth).

Finally, we observe that an increasing number of websites explicitly annotates entity identifiers, such as product identifiers, as well as other identifying attributes such as telephone numbers or geo coordinates for local businesses. Schema.org provides different terms for annotating different types of product identifiers, with schema:Product/sku being the most popular among them. Over the past five years, the relative adoption of the schema:Product/sku property has increased from 21% to 66%. The properties schema:LocalBusiness/telephone and schema:LocalBusiness/geo have also seen a comparable increased growth in the last five years from 64% to 76% and from 6% to 32%, respectively. This verifies our previous observation on the increasing richness of the annotations.

Download

The overall size of the October 2022 RDFa, Microdata, Embedded JSON-LD and Microformat data sets is 86.5 billion RDF quads. For download, we split the data into 15,819 files with a total size of 1.6 TB.

http://webdatacommons.org/structureddata/2022-12/stats/how_to_get_the_data.html

In addition, we have created for 44 different schema.org classes separate files, including all quads extracted from pages, using a specific schema.org class.

http://webdatacommons.org/structureddata/2022-12/stats/schema_org_subsets.html

Lots of thanks to:

+ the Common Crawl project for providing their great web crawl and thus enabling the WebDataCommons project.

+ the Any23 project for providing and maintaining their great library of structured data parsers.

General Information about the WebDataCommons Project

Since 2010 the WebDataCommons project has yearly extracted structured data from the Common Crawl, the largest web corpus available to the public, and provides the extracted data for public download in order to support researchers and companies in exploiting the wealth of information that is available on the Web. Besides the yearly extractions of semantic annotations from webpages, the WebDataCommons project provides large hyperlink graphs, the largest public corpus of web tables, two corpora of product data, as well as a collection of hypernyms extracted from billions of web pages for public download. General information about the WebDataCommons project is found at

https://webdatacommons.org/<http://webdatacommons.org/>

Have fun with the new data set.

Cheers,

Alexander Brinkmann and Chris Bizer

Received on Thursday, 26 January 2023 09:19:30 UTC